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Abstract 


Avoiding  Communication  in  Dense  Linear  Algebra 

by 

Grey  Malone  Ballard 

Doctor  of  Philosophy  in  Computer  Science 
with  a  Designated  Emphasis  in  Computational  Science  and  Engineering 
University  of  California,  Berkeley 
Professor  James  Demmel,  Chair 


Dense  linear  algebra  computations  are  essential  to  nearly  every  problem  in  scientific 
computing  and  to  countless  other  fields.  Most  matrix  computations  enjoy  a  high  compu¬ 
tational  intensity  (he.,  ratio  of  computation  to  data),  and  therefore  the  algorithms  for  the 
computations  have  a  potential  for  high  efficiency.  However,  performance  for  many  linear 
algebra  algorithms  is  limited  by  the  cost  of  moving  data  between  processors  on  a  parallel 
computer  or  throughout  the  memory  hierarchy  of  a  single  processor,  which  we  will  refer  to 
generally  as  communication.  Technological  trends  indicate  that  algorithmic  performance  will 
become  even  more  limited  by  communication  in  the  future.  In  this  thesis,  we  consider  the 
fundamental  computations  within  dense  linear  algebra  and  address  the  following  question: 
can  we  significantly  improve  the  current  algorithms  for  these  computations,  in  terms  of  the 
communication  they  require  and  their  performance  in  practice? 

To  answer  the  question,  we  analyze  algorithms  on  sequential  and  parallel  architectural 
models  that  are  simple  enough  to  determine  coarse  communication  costs  but  accurate  enough 
to  predict  performance  of  implementations  on  real  hardware.  For  most  of  the  computations, 
we  prove  lower  bounds  on  the  communication  that  any  algorithm  must  perform.  If  an 
algorithm  exists  with  communication  costs  that  match  the  lower  bounds  (at  least  in  an 
asymptotic  sense),  we  call  the  algorithm  communication  optimal.  In  many  cases,  the  most 
commonly  used  algorithms  are  not  communication  optimal,  and  we  can  develop  new  algo¬ 
rithms  that  require  less  data  movement  and  attain  the  communication  lower  bounds. 

In  this  thesis,  we  develop  both  new  communication  lower  bounds  and  new  algorithms, 
tightening  (and  in  many  cases  closing)  the  gap  between  best  known  lower  bound  and  best 
known  algorithm  (or  upper  bound).  We  consider  both  sequential  and  parallel  algorithms,  and 
we  asses  both  classical  and  fast  algorithms  (e.g.,  Strassen’s  matrix  multiplication  algorithm). 
In  particular,  the  central  contributions  of  this  thesis  are 

•  proving  new  communication  lower  bounds  for  nearly  all  classical  direct  linear  algebra 
computations  (dense  or  sparse),  including  factorizations  for  solving  linear  systems, 
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least  squares  problems,  and  eigenvalue  and  singular  value  problems, 

•  proving  new  communication  lower  bounds  for  Strassen’s  and  other  fast  matrix  multi¬ 
plication  algorithms, 

•  proving  new  parallel  communication  lower  bounds  for  classical  and  fast  computations 
that  set  limits  on  an  algorithm’s  ability  to  perfectly  strong  scale, 

•  summarizing  the  state-of-the-art  in  communication  efficiency  for  both  sequential  and 
parallel  algorithms  for  the  computations  to  which  the  lower  bounds  apply, 

•  developing  a  new  communication-optimal  algorithm  for  computing  a  symmetric-indef¬ 
inite  factorization  (observing  speedups  of  up  to  2.8  x  compared  to  alternative  shared- 
memory  parallel  algorithms), 

•  developing  new,  more  communication-efficient  algorithms  for  reducing  a  symmetric 
band  matrix  to  tridiagonal  form  via  orthogonal  similar  transformations  (observing 
speedups  of  2-6 x  compared  to  alternative  sequential  and  parallel  algorithms),  and 

•  developing  a  new  communication-optimal  parallelization  of  Strassen’s  matrix  mul¬ 
tiplication  algorithm  (observing  speedups  of  up  to  2.84  x  compared  to  alternative 
distributed-memory  parallel  algorithms). 
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Chapter  1 
Introduction 


1.1  The  Role  of  Scientific  Computing 

Numerical  simulation  has  emerged  as  the  “third  pillar”  of  science  along  with  theory  and 
experimentation  [33] .  Because  of  the  availability  of  increasingly  powerful  computers  and  the 
rich  development  in  applied  mathematical  modeling  and  algorithmic  efficiency,  scientists  are 
able  to  simulate  physical  phenomena  that  would  otherwise  be  too  expensive,  too  dangerous, 
too  time-consuming,  or  simply  impossible  to  observe.  As  computing  facilities  grow,  the  size 
and  complexity  of  the  problems  scientists  are  able  to  tackle  continue  to  increase.  The  field  of 
scientific  computing  aims  to  enable  domain-specific  computational  scientists  to  develop  and 
test  their  ideas  (through  simulation,  data  analysis,  or  other  means)  by  providing  software 
that  produces  results  both  quickly  and  accurately. 

While  the  diversity  of  scientific  domains  that  depend  on  efficient  computation  is  vast, 
from  the  computer  science  point  of  view,  the  number  of  different  types  of  fundamental  com¬ 
putational  patterns  required  to  support  all  these  domains  is  quite  small.  Many  applications 
that  are  very  different  from  each  other  can  be  mathematically  modeled  in  the  same  way. 
For  example,  simulating  the  behavior  of  a  small  group  of  electrons  to  high  accuracy  (using 
“coupled  cluster”  theory  from  computational  chemistry)  and  statistically  grouping  variables 
within  large  social  science  data  sets  (using  “principal  component  analysis”  from  statistics) 
are  two  applications  whose  main  computational  requirements  are  strikingly  similar.  In  fact, 
one  classification  of  fundamental  computational  patterns  identifies  seven  different  patterns 
(coined  “the  seven  dwarves”)  that  cover  nearly  all  domains  of  computational  science  [52], 
One  of  the  dwarves  is  dense  linear  algebra,  the  topic  of  this  thesis. 


1.2  The  Importance  of  Dense  Linear  Algebra 

Dense  linear  algebra  refers  to  matrix  computations  where  the  matrices  involved  do  not 
have  many  zero  entries.  Common  matrix  computations  include  solving  linear  systems  of 
equations,  solving  “least  squares”  problems,  and  computing  eigenvalue  and  singular  value 
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decompositions.  Dense  linear  algebra  is  an  especially  important  dwarf  for  two  reasons:  (1) 
a  large  fraction  of  applications  require  some  dense  linear  algebra  computation,  even  if  most 
of  the  time  is  spent  on  other  dwarves,  and  (2)  dense  matrix  computations  have  consistently 
performed  well,  even  on  emerging  architectures.  For  example,  based  on  the  2010  computing 
resource  allocation  requests  of  the  National  Energy  Research  Scientific  Computing  Center  (a 
division  of  Lawrence  Berkeley  National  Laboratory),  out  of  a  total  of  427  projects,  204  use 
LAPACK  [8]  (ranked  first  of  all  requested  libraries)  and  113  use  ScaLAPACK  [44]  (ranked 
third),  both  of  which  are  dense  linear  algebra  libraries.  These  libraries  are  able  to  obtain 
relatively  high  performance  because  they  benefit  from  the  computations  being  regular  (and 
therefore  have  been  heavily  optimized  over  many  years)  and  have  high  arithmetic  intensity 
(a  ratio  of  computation  to  data  size). 

Over  the  past  few  decades,  there  have  been  many  advances  in  dense  linear  algebra  soft¬ 
ware.  The  LAPACK  library,  which  includes  the  Basic  Linear  Algebra  Subroutines  (BLAS) 
[45],  was  developed  in  the  1980s  and  targeted  single  processor  machines  (and  vector  ma¬ 
chines  of  the  time).  In  the  1990s,  the  Scalable  LAPACK  (ScaLAPACK)  library  extended 
the  functionality  to  distributed- memory  parallel  computers,  where  processors  communicate 
over  a  network.  Today,  these  software  packages  provide  reference  implementations,  but 
many  other,  more-optimized  versions  of  the  libraries  exist.  Some  of  these  libraries  are  de¬ 
veloped  and  supported  by  hardware  vendors,  like  Intel’s  Math  Kernel  Library  (MKL)  [93], 
IBM’s  Engineering  and  Scientific  Subroutine  Library  (ESSL)  [91],  Cray’s  LibSci  [53],  and 
NVIDIA’s  CLIDA  BLAS  (CLIBLAS)  [117],  and  some  are  open  source,  like  the  Automatically 
Tuned  Linear  Algebra  Software  (ATLAS)  project  [149].  There  are  also  open-source  libraries 
targeting  emerging  parallel  architectures  like  the  Parallel  Linear  Algebra  for  Scalable  Multi¬ 
core  Architectures  (PLASMA)  [6]  and  Matrix  Algebra  on  GPLT  and  Multicore  Architectures 
(MAGMA)  [143]  libraries,  as  well  as  the  BLIS,  libFLAME,  and  Elemental  libraries  devel¬ 
oped  within  the  FLAME  project  [81].  This  range  of  available  software  demonstrates  both  the 
demand  for  dense  linear  algebra  computation  as  well  as  the  various  techniques  for  attaining 
high  performance. 


1.3  The  Rise  of  Parallelism  and  the  Relative  Costs  of 
Communication 

In  addition  to  the  importance  of  dense  linear  algebra,  a  major  reason  for  the  long  list  of  linear 
algebra  packages  is  the  diversity  of  today’s  machines.  In  particular,  because  of  the  power, 
memory,  and  instruction-level-parallelism  walls  that  began  to  hit  the  hardware  industry 
in  2004  [9],  in  order  to  improve  processor  performance,  vendors  were  forced  to  introduce 
parallelism  into  mainstream  computer  architectures.  Parallelism  was  introduced  in  the  form 
of  chips  with  multiple  cores  ( “multicore” )  and  with  wider  vector  or  Same  Instruction  Multiple 
Data  (SIMD)  lanes,  as  in  general-purpose  Graphics  Processing  LInits  (GPUs).  Processor 
manufacturers  continue  to  adapt  to  this  new  paradigm  and  use  various  techniques  to  handle 
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the  complications  that  arise.  As  a  result,  parallelism  is  both  ubiquitous  and  heterogeneous. 

Even  before  2004,  another  hardware  trend  began  to  affect  the  performance  of  dense 
linear  computations.  Despite  the  fact  that  many  dense  matrix  computations  enjoy  a  high 
computational  intensity — i.e.,  every  entry  of  the  matrix,  also  referred  to  as  a  word  of  data, 
is  involved  in  many  arithmetic  or  floating  point  operations  (flops) — the  cost  associated  with 
accessing  data  from  memory  to  perform  computations  has  become  non-negligible.  This 
cost  of  data  movement,  or  communication,  is  higher  than  that  of  performing  a  flop,  and 
more  importantly,  this  gap  has  been  growing  exponentially  over  time.  Both  costs  have 
been  improving,  but  the  time  to  perform  a  flop  has  been  improving  by  about  59%  per  year 
while  the  rate  at  which  words  can  be  accessed  from  memory  (i.e.,  memory  bandwidth)  has 
improved  by  about  23%  per  year  [76].  Network  improvements  have  followed  a  similar  pattern, 
with  the  relative  costs  of  communication  growing  exponentially  also  for  distributed-memory 
machines.  Thus,  not  only  is  communication  becoming  more  important  due  to  the  rise  in 
parallelism,  the  relative  costs  of  communication  are  also  increasing. 

1.4  Thesis  Goals  and  Contributions 

The  focus  of  this  thesis  is  to  develop  a  systematic  approach  for  designing  and  analyzing  dense 
linear  algebra  algorithms,  paying  particular  attention  to  avoiding  communication  costs  as 
much  as  possible,  in  order  to  optimize  performance  on  a  wide  range  of  platforms.  To  maintain 
general  applicability  of  algorithmic  ideas,  we  will  consider  one  sequential  and  one  parallel 
architectural  model  that  are  simple  enough  to  facilitate  coarse  asymptotic  communication 
analysis  of  algorithms  but  accurate  enough  to  predict  performance  of  implementations  on 
real  hardware. 

In  many  cases,  we  can  prove  lower  bounds  on  the  communication  required  of  a  compu¬ 
tation  on  a  particular  machine  model,  thereby  setting  a  target  for  algorithmic  performance. 
Establishing  lower  bounds  and  developing  and  improving  algorithms  is  often  a  simultane¬ 
ous  process,  with  the  objective  of  identifying  algorithms  that  are  provably  communication 
optimal.  After  an  efficient  algorithm  has  been  developed  for  the  modeled  machine,  we  will 
rely  on  automatic  performance  tuning,  or  autotuning,  to  tweak  the  parameters  of  the  algo¬ 
rithm  for  optimal  performance  on  a  particular  platform.  The  ultimate  goal  of  this  work  is 
to  deliver  the  improved  performance  of  the  new  and  improved  algorithms  to  scientists  across 
many  disciplines  and  other  users  by  integrating  the  ideas  into  the  state-of-the-art  libraries. 

To  this  end,  the  goals  of  this  thesis  are  to 

(1)  prove  lower  bounds  on  the  communication  required  by  matrix  computations,  thereby 

setting  targets  for  algorithmic  performance; 

(2)  survey  the  current  standard  algorithms  and  their  communication  costs  and  identify  pos¬ 
sibilities  for  improvement;  and 
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(3)  present  new  algorithms  for  avoiding  communication  and  demonstrate  their  impact  on 
asymptotic  costs  and  actual  performance. 

In  particular,  the  central  contributions  of  this  work  are  that  we 

•  prove  new  communication  lower  bounds  for  LU  and  Cholesky  decompositions  using 
reductions  from  matrix  multiplication; 

•  extend  an  existing  communication  lower  bound  for  matrix  multiplication  [95]  to  “three- 
nested-loops”  computations  for  dense  or  sparse  matrices  on  sequential  or  parallel  ma¬ 
chines,  which  include  BLAS  computations,  one-sided  factorizations  like  LU,  Cholesky, 
LDLT,  and  QR,  two-sided  factorizations  for  eigenvalue  and  singular  value  decompo¬ 
sitions,  and  some  computations  outside  of  numerical  linear  algebra,  like  computing 
all-pairs  shortest  paths  of  a  graph; 

•  prove  new  communication  lower  bounds  for  Strassen’s  [139]  and  “Strassen-likc”  fast 
matrix  multiplication  algorithms  by  analyzing  the  the  edge  expansion  of  their  compu¬ 
tation  graphs; 

•  prove  new  communication  lower  bounds  for  parallel  algorithms  (both  classical  and 
Strassen-likc)  that  are  independent  of  the  local  memory  size  and  impose  limits  of  the 
possibility  of  perfect  strong  scaling  for  the  algorithms; 

•  summarize  the  asymptotic  communication  costs  of  the  current  state-of-the-art  sequen¬ 
tial  algorithms  (including  our  recent  contributions,  some  of  which  do  not  appear  in 
this  thesis)  for  numerical  linear  algebra,  in  terms  of  words,  messages,  and  cache- 
obliviousness,  and  discuss  their  performance  in  practice; 

•  summarize  the  asymptotic  communication  costs  of  the  current  state-of-the-art  paral¬ 
lel  algorithms  (including  our  recent  contributions,  some  of  which  do  not  appear  in 
this  thesis)  for  numerical  linear  algebra,  in  terms  of  words,  messages,  and  memory 
requirements,  and  discuss  their  performance  in  practice; 

•  introduce  the  first  communication-optimal  sequential  factorization  of  symmetric  indef¬ 
inite  matrices  based  on  Aasen’s  algorithm  [1],  prove  its  backward  stability  (subject 
to  a  growth  factor),  and  present  numerical  experiments  measuring  numerical  stability 
(observing  speedups  of  up  to  2.8 x  compared  to  alternative  shared-memory  parallel 
algorithms  [15]); 

•  present  a  set  of  improvements  and  new  sequential  and  parallel  algorithms  based  on 
successive  band  reduction  that  asymptotically  reduce  communication  in  computing  the 
eigendecomposition  of  symmetric  band  matrices  (observing  speedups  of  2-6  x  compared 
to  alternative  sequential  and  parallel  algorithms  [31]);  and 
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•  describe  a  communication-optimal  parallelization  of  Strassen’s  matrix  multiplication 
algorithm  that  is  superior  to  all  other  matrix  multiplication  algorithms,  both  in  terms 
of  its  asymptotic  communication  costs  and  its  performance  in  practice  (observing 
speedups  of  up  to  2.84  x  compared  to  alternative  distributed-memory  parallel  algo¬ 
rithms  [20]). 


1.5  Thesis  Organization 

The  content  of  this  thesis  is  divided  between  communication  lower  bounds  and  algorithms. 
Chapter  2  presents  preliminary  ideas  that  will  be  useful  throughout  the  thesis,  Part  1  (Chap¬ 
ters  3-6)  presents  communication  lower  bound  results,  and  Part  If  (Chapters  7-11)  discusses 
algorithms  and  their  analyses.  We  conclude  and  discuss  future  directions  in  Chapter  12. 

Of  the  chapters  in  Part  I,  Chapter  3  discusses  known  lower  bounds  for  classical  matrix 
multiplication  and  how  to  extend  those  results  to  other  computations  using  reduction  ar¬ 
guments.  Chapter  4  establishes  general  theorems  which  can  be  applied  to  most  “classical” 
linear  algebra  algorithms,  and  Chapter  5  proves  communication  lower  bounds  for  Strassen’s 
matrix  multiplication  algorithm.  In  Chapter  6  we  discuss  extensions  and  further  applications 
of  the  lower  bound  techniques  and  results. 

The  chapters  in  Part  If  are  organized  as  follows.  Chapters  7  and  8  summarize  the 
most  communication-efficient  algorithms  on  sequential  and  parallel  machines,  respectively. 
Chapters  9-11  discuss  new  communication- avoiding  algorithms  for  three  different  computa¬ 
tions:  symmetric  indefinite  matrix  factorization  (Chapter  9),  orthogonal  tridiagonalization 
of  a  symmetric  band  matrix  (Chapter  10),  and  parallelizing  Strassen’s  matrix  multiplication 
algorithm  (Chapter  11). 
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Chapter  2 
Preliminaries 


2.1  Notation  and  Definitions 

In  this  section  we  define  general  notation  and  terminology  used  throughout  the  thesis.  The 
most  common  parameters  considered  include  n,  the  matrix  dimension  of  a  square  matrix; 
M,  the  size  of  local  memory  available  to  a  processor;  and  P,  the  number  of  processors  on  a 
parallel  computer.  For  rectangular  matrices,  we  generally  use  m  for  the  number  of  rows  and 
n  for  the  number  of  columns.  We  use  the  notation  Ig  =  log2  and  specify  the  base  of  other 
logarithms,  except  when  using  asymptotic  notation. 

2.1.1  Asymptotic  Notation 

We  use  standard  asymptotic  notation  throughout  the  thesis.  In  particular,  we  use  ()(■}:  ff(-), 
©(•).  Formally,  f(n)  =  0(g(n ))  implies  that  for  sufficiently  large  n  there  exists  a  positive 
constant  c  such  that  f(n)  <  cg(n );  f(n)  =  f 1(g(n))  implies  that  for  sufficiently  large  n  there 
exists  a  positive  constant  c  such  that  f(n)  >  cg(n );  and  f(n)  =  Q(g(n))  implies  that  both 
f(n)  =  0(g(n))  and  f(n)  =  £l(g(n)).  Informally,  we  use  the  notation  to  hide  constants  which 
are  small  relative  to  the  values  of  n,  M,  and  P  (and  possibly  other  parameters)  for  reasonably 
sized  problems.  We  also  use  the  notation  f(n)  =  0(g(n))  to  mean  f(n)  =  0(g(n)  logfc  n) 
for  some  positive  constant  k,  and  f(n)  <C  g(n)  and  g(n)  S>  f(n)  to  mean  that  for  every 
positive  constant  c,  f(n)  <  cg(n)  for  sufficiently  large  n.  Informally,  O(-)  means  “ignoring 
logarithmic  factors”  and  f(n)  <C  g(n)  means  that  /  is  insignificant  relative  to  g. 

We  also  emphasize  that  our  asymptotic  notation  does  not  imply  that  the  results  require 
n  to  approach  infinity  to  be  correct.  While  the  asymptotic  analyses  provide  better  estimates 
for  larger  parameter  values,  they  have  proved  meaningful  and  useful  in  algorithmic  design 
for  modestly  sized  problems.  For  example,  problems  with  matrix  dimensions  in  the  100s 
and  processor  counts  in  the  10s  are  already  affected  by  communication  costs  and  can  benefit 
from  the  ideas  based  on  the  asymptotic  costs. 
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2.1.2  Algorithmic  Terminology 

In  this  thesis,  we  will  focus  primarily  on  direct  algorithms  in  linear  algebra,  where  the 
computation  is  specified  by  the  structure  of  the  input  ( e.g .,  the  matrix  dimension)  and  not 
on  the  numerical  values  of  the  input.  We  specify  direct  algorithms  in  contrast  to  iterative 
algorithms  (e.g.,  Krylov  subspace  methods),  where  the  computation  repeatedly  improves  its 
computed  solution  and  the  amount  of  arithmetic  depends  on  the  numerical  properties  of  the 
input. 

We  use  the  term  computation  to  refer  to  a  specified  set  of  inputs  and  outputs,  as  well  as 
the  mathematical  dependencies  among  them.  We  can  thus  represent  a  computation  with  a 
directed  acyclic  graph,  or  CD  AG,  consisting  of  nodes  for  input  and  output  elements,  as  well 
as  temporary  intermediate  values,  and  edges  representing  mathematical  dependencies. 

We  use  the  term  algorithm  to  refer  to  a  particular  scheduling  of  a  computation  on  a  given 
memory  model.  On  a  sequential  machine,  an  algorithm  specifies  the  order  of  evaluation  of 
the  scalar  operations;  on  a  parallel  machine,  an  algorithm  specifies  on  which  processor  an 
operation  is  performed  and  the  order  of  operations  for  each  processor.  A  correct  algorithm 
specifies  an  order  that  respects  the  dependencies  of  the  computation. 

For  the  purposes  of  this  thesis,  even  though  we  consider  only  unary  and  binary  scalar 
operations,  we  do  not  require  that  all  nodes  in  the  CDAG  have  in-degree  less  than  or  equal 
to  two.  In  particular,  the  output  of  a  summation  of  n  values  has  in-degree  n.  We  allow  the 
algorithm  (rather  than  the  computation)  to  exploit  associativity  and  specify  the  shape  of 
the  binary  tree  used  to  compute  the  output.1 

We  will  consider  two  main  classes  of  computations/algorithms  (because  these  classes 
do  not  depend  on  communication,  we  do  not  differentiate  between  computations  and  algo¬ 
rithms).  The  following  definition  is  based  on  [56]: 

Definition  2.1.  A  classical  algorithm  in  linear  algebra  is  one  that  uses  only  data  on  which 
an  output  matrix  entry  mathematically  depends 2  to  compute  its  value. 

For  example,  in  the  case  of  matrix  multiplication  AB  =  C,  a  classical  algorithm  uses 
only  nonzero  entries  in  row  i  of  A  and  column  j  of  B  to  compute  CV] .  In  general,  a  classical 
algorithm  applied  to  dense  matrices  performs  @(n3)  flops.  Note  that  a  classical  algorithm 
designed  for  sparse  matrices  will  avoid  flops  with  zeros  and  may  do  far  fewer  operations  than 
@(n3).  We  also  use  the  term  conventional  as  a  synonym  of  classical.  We  define  classical 
algorithms  in  order  to  differentiate  them  from  fast  algorithms. 

Definition  2.2.  Given  a  dense  linear  algebra  problem  for  which  the  most  computationally 
efficient  classical  algorithm  performs  @(n3)  flops,  a  fast  algorithm  for  that  problem  is  one 
that  performs  OfrG0)  flops,  where  co0  <  3. 

1Note  that  because  floating  point  addition  is  not  associative,  this  implies  that  two  correct  algorithms  for 
the  same  computation  may  compute  different  output  values. 

2 We  assume  generic  input,  ignoring  the  possibility  of  exact  cancellation. 
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In  particular,  fast  algorithms  do  not  provide  merely  a  different  scheduling  for  a  classical 
computation;  they  are  based  on  a  different  computation  graph  that  computes  equivalent 
outputs  in  exact  arithmetic.  They  introduce  computational  dependencies  between  input  and 
output  entries  which  have  no  mathematical  dependence,  but  they  do  so  using  distributivity 
and  cancellation  in  a  way  that  decreases  the  total  amount  of  arithmetic  required.  The  most 
well-known  fast  algorithm  is  Strassen’s  algorithm  for  dense  matrix  multiplication  [139].  See 
Section  2.4  for  two  variants  of  his  algorithm. 

2.1.3  Communication  Terminology 

In  order  to  estimate  the  running  time  of  an  algorithm,  we  consider  both  the  computation 
and  the  communication  it  performs.  Since  the  algorithms  considered  are  most  commonly 
used  for  matrices  with  real-  or  complex-valued  entries,  we  measure  computation  in  terms 
of  floating  point  operations  (flops).  We  measure  communication  in  terms  of  words  and 
messages,  where  a  word  is  one  floating  point  number  and  a  message  is  a  group  of  words 
communicated  simultaneously.  See  Section  2.2  for  a  more  precise  description  of  a  message 
in  each  of  the  memory  models. 

We  use  F,  W,  and  S  to  denote  the  number  of  flops,  words,  and  messages,  respectively. 
While  these  quantities  are  generally  functions  of  parameters  like  n ,  M ,  and/or  P,  we  omit  the 
arguments  when  the  context  is  clear.  We  also  use  the  terms  computational  cost,  bandwidth 
cost,  and  latency  cost  (defined  below)  to  refer  to  these  quantities. 

We  assume  a  fixed  cost  7  to  perform  a  single  flop,  and  we  do  not  differentiate  among  scalar 
additions,  subtractions,  multiplication,  divisions,  and  square  roots.  For  communication,  we 
assume  the  cost  to  communicate  a  message  of  m  words  is  a  +  /3m,  where  a  is  the  fixed  cost 
for  a  message  and  (3  is  the  fixed  cost  for  each  word.  With  these  assumptions,  we  can  model 
the  running  time  T  of  an  algorithm  as  a  sum  of  three  terms: 

T  =  1  ■  F  +  (3  -W  Pot- S. 

Note  that  this  running  time  model  ignores  any  overlap  of  computation  and  communi¬ 
cation.  If  we  model  overlap  of  computation  with  communication,  the  running  time  may 
decrease  by  at  most  a  factor  of  two,  which  will  be  inconsequential  in  our  asymptotic  analy¬ 
sis.  However,  to  obtain  satisfactory  performance  on  an  actual  implementation,  overlapping 
computation  and  communication  is  an  important  and  necessary  optimization. 

Because  we  are  interested  in  running  time,  for  parallel  algorithms  we  will  consider  these 
costs  along  the  critical  path  of  the  algorithm,  the  longest  path  (measured  by  time)  in  the 
dependency  graph  of  the  algorithm’s  computation  and  communication  steps.  That  is,  if 
two  processors  perform  a  flop  simultaneously,  or  if  two  pairs  of  processors  communicate  a 
message  simultaneously,  the  cost  is  that  of  one  flop  or  one  message.3  For  example,  see  [155] 
for  a  discussion  of  critical  path  analysis,  where  the  dependency  graph  is  referred  to  as  the 
“program  activity  graph,”  or  PAG.  In  the  PAG,  each  vertex  corresponds  to  the  beginning 

3Note  that  this  implies  we  ignore  network  contention;  see  Section  2.2.2. 
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or  end  of  a  computation  step  (performing  some  number  of  flops)  or  a  communication  step 
(communicating  a  message  of  some  number  of  words)  of  a  particular  processor,  and  the 
weight  of  an  edge  between  beginning  and  end  vertices  is  the  cost  (in  time)  of  the  step.  For 
example,  if  a  computation  step  consists  of  /  flops,  the  weight  of  the  edge  is  7  •  /.  Extra  edges 
of  weight  zero  represent  precedence  relationships  (e.g.,  a  processor  must  wait  to  receive  data 
before  it  can  perform  computation  with  it).  The  PAG  is  distinct  from  the  CDAG  defined  in 
Section  2.1.2  because  it  depends  on  the  algorithm  used  (not  only  the  computation),  but  it 
inherits  some  of  its  dependencies  from  the  computational  dependencies. 

Note  that,  in  general,  determining  the  critical  path  depends  on  the  relative  costs  of 
computation  and  communication.  For  example,  there  may  be  one  path  that  is  dominated 
by  computation  steps  and  another  dominated  by  communication  steps,  so  either  path  may 
be  the  critical  one.  In  order  to  keep  the  algorithmic  analysis  separate  from  the  machine- 
specific  costs  of  computation  and  communication,  we  expand  the  notion  of  one  critical  path 
to  three  (possibly)  distinct  paths.  We  define  the  critical  path  with  respect  to  flops  as  the  path 
through  the  algorithm’s  dependency  graph  with  the  greatest  number  of  flops  performed,  and 
we  define  the  critical  path  with  respect  to  words  and  the  critical  path  with  respect  to  messages 
as  the  paths  through  the  dependency  graph  with  the  greatest  number  of  words  and  messages 
communicated,  respectively.  For  most  algorithms  considered  in  this  thesis,  these  three  paths 
are  the  same,  independent  of  the  relative  values  of  a,  fl,  and  7. 

We  define  these  critical  paths  to  distinguish  our  communication  costs  from  another  metric 
of  communication:  communication  volume,  or  the  sum  over  all  processors  of  the  communica¬ 
tion  performed.  Because  this  metric  is  not  as  closely  related  to  the  running  time  of  parallel 
algorithms  on  most  networks,  we  do  not  consider  it  further. 

We  now  define  the  principal  metrics  by  which  we  will  evaluate  algorithms  in  terms  of 
computation  and  communication.  For  generality,  we  define  these  metrics  in  terms  of  critical 
paths;  sequential  algorithms  have  only  one  path. 

Definition  2.3.  The  computational  cost  of  an  algorithm  is  the  number  of  flops  performed 
along  the  critical  path  with  respect  to  flops. 

Definition  2.4.  The  bandwidth  cost  of  an  algorithm  is  the  number  of  words  communicated 
along  the  critical  path  with  respect  to  words  on  a  given  memory  model. 

Definition  2.5.  The  latency  cost  of  an  algorithm  is  the  number  of  messages  communicated 
along  the  critical  path  with  respect  to  messages  on  a  given  memory  model. 

Note  that  lower  bounds  on  the  computational,  bandwidth,  and  latency  costs  can  be 
established  by  proving  the  existence  of  a  single  path  with  a  given  number  of  flops,  words, 
or  messages.  We  will  often  use  a  single  processor’s  path,  the  edges  corresponding  to  that 
processor’s  computation  and  communication  steps  throughout  the  algorithm  execution,  to 
establish  these  lower  bounds,  thus  ignoring  all  inter-processor  dependencies. 

We  also  define  two  commonly  used  terms  to  describe  algorithms  with  low  communication 
costs.  The  first  definition  is  informal,  and  we  use  the  term  more  loosely.  A  communication¬ 
avoiding  algorithm  is  one  that  asymptotically  reduces  the  bandwidth  cost  and/or  latency  cost 
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compared  to  the  previous  most  communication-efficient  algorithms  for  the  same  computation. 
That  is,  we  use  the  adjective  “communication-avoiding”  to  denote  asymptotic  improvement, 
as  opposed  to  improvements  by  constant  factors  that  are  independent  of  n,  M,  or  P.  We  also 
use  the  descriptive  to  distinguish  from  communication  “hiding”  by  overlapping  computation 
and  communication.  Hiding  communication  is  an  important  practical  optimization,  but  it 
improves  run  time  only  by  a  constant  factor  and  so  is  ignored  in  our  theoretical  model. 

The  next  definition  is  more  precise: 

Definition  2.6.  A  communication-optimal  algorithm  is  one  whose  bandwidth  and  latency 
costs  attain  known  lower  bounds  in  an  asymptotic  sense 4  for  the  corresponding  computation. 

Thus,  the  existence  of  communication-optimal  algorithms  indicates  matching  lower  bounds 
(given  by  a  proof)  and  upper  bounds  (given  by  an  algorithm). 

2.2  Memory  Models 

We  consider  two  main  types  of  architectures  in  this  thesis:  sequential  and  parallel  computers 
(see  Figure  2.1).  Our  goals  in  specifying  the  two  machine  models  are  to  (1)  provide  abstrac¬ 
tions  that  are  simple  enough  to  reason  about  theoretically  but  realistic  enough  to  be  useful 
in  practice  and  (2)  provide  two  models  that  are  distinct  enough  to  capture  the  differences 
between  sequential  and  parallel  computation  but  similar  enough  to  share  terminology  and 
fundamental  reasoning.  To  this  end,  both  models  include  the  parameters  M,  a,  (3,  and  7  and 
consider  communication  in  terms  of  bandwidth  and  latency  costs.  However,  the  terminology 
must  be  interpreted  slightly  differently  in  the  two  models.  For  example,  as  we  will  see  below, 
M  is  the  size  of  the  fast  memory  in  the  sequential  model,  and  it  is  the  size  of  the  local 
memory  in  the  parallel  model. 

2.2.1  Two-Level  Sequential  Memory  Model 

We  model  a  sequential  machine  with  two  levels  of  memory  hierarchy  (fast  and  slow)  and 
measure  the  communication  between  these  two  levels  during  the  execution  of  the  algorithm. 
See  Figure  2.1a  for  a  depiction  of  the  model.  We  use  M  to  denote  the  size  of  the  fast  memory 
in  words.  If  words  are  stored  contiguously  in  slow  memory,  then  they  can  be  read  or  written 
together  as  a  message.  Note  that  we  allow  messages  to  range  in  size  from  one  word  to  the 
size  of  the  fast  memory.  As  described  in  Section  2.1.3,  we  measure  the  number  of  words  and 
messages  separately. 

A  simple  generalization  of  this  model  is  to  consider  a  smaller  maximum  message  size 
L  <  M,  which  is  appropriate  for  modeling  cache  lines  in  cache  memories,  for  example.  We 
use  this  notation  in  the  statement  of  Theorem  2.11,  but  in  general  we  assume  L  =  M  because 
this  will  minimize  the  latency  lower  bound  for  any  possible  architecture. 

4By  this  we  mean  up  to  a  constant  factor  and,  in  some  cases,  up  to  a  polylogarithmic  factor  in  n,  M ,  or 
P. 
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(a)  Two-level  sequential  memory  model  (b)  Distributed-memory  parallel  model 

Figure  2.1:  Main  types  of  memory  models  used  for  communication  cost  analysis. 


2. 2. 1.1  Related  Models 

This  model  is  similar  to  the  two-level  I/O,  disk  access  model,  or  ideal-cache  model  (e.g.,  see 
[5,  70,  88]),  and  the  bandwidth  and  latency  costs  are  closely  related  to  the  1/ O-complexity  of 
an  algorithm.  In  some  variations  of  these  models,  the  I/O  complexity  refers  to  the  number 
of  words  transferred  [88],  and  the  contiguity  of  words  in  slow  memory  is  ignored.  In  other 
variations,  I/O  complexity  refers  to  the  number  of  messages,  or  block  transfers,  where  the 
transfer  block  size  is  limited  to  B  <  M  [5].  Note  that  in  this  model,  every  message  is  of  size 
exactly  R,  as  opposed  to  having  only  a  maximum  message  size  L.  In  this  case,  the  costs  of 
moving  one  word  and  B  contiguous  words  are  the  same.  Note  also  that  in  this  variation, 
it  is  possible  to  communicate  multiple  blocks  simultaneously  to  model  the  parallelism  in 
the  memory  system  even  for  a  sequential  computational  unit.  The  ideal-cache  model  [70]  is 
similar,  with  B  representing  the  cache  line  size,  but  only  one  cache  line  can  be  communicated 
at  a  time. 

2.2. 1.2  Hierarchical  Memory  Model 

We  also  consider  a  generalization  of  the  two-level  memory  model  to  memory  hierarchies 
(see  [129],  for  example).  In  the  hierarchical  memory  model,  there  are  multiple  levels  of 
memory  with  monotonically  increasing  size  from  the  smallest  memory  where  computation  is 
performed  to  the  largest  memory  where  all  data  can  reside  simultaneously.  Data  may  move 
only  between  successive  levels  of  memory,  and  the  costs  of  data  movement  vary  among  pairs 
of  successive  levels.  We  consider  this  model  particularly  in  the  context  of  cache- oblivious 
algorithms  (see  [70]  and  Chapter  7)  that  attain  optimal  communication  costs  between  all 
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pairs  of  levels  if  they  are  communication-optimal  in  the  two-level  model.  Note  that  cache- 
aware  algorithms  can  also  minimize  communication  in  this  model,  though  they  must  be 
tuned  for  each  level  of  memory. 

2.2.2  Distributed-Memory  Parallel  Model 

In  the  distributed- memory  parallel  case,  we  model  the  machine  as  a  collection  of  P  processors, 
each  with  a  limited  local  memory  of  size  M,  connected  over  a  network.  The  local  memory 
size  limit  may  be  a  result  of  the  physical  hardware  or  a  restriction  on  the  algorithm  ( e.g .,  the 
algorithm  may  be  limited  to  using  only  a  constant  factor  more  memory  than  what  is  required 
to  store  the  input  and  output).  Processors  communicate  via  point-to-point  messages.  See 
Figure  2.1b  for  a  depiction  of  the  model.  Again,  we  are  interested  in  both  the  number  of 
words  (bandwidth  cost)  and  messages  (latency  cost),  and  we  count  these  costs  along  the 
critical  path(s)  of  the  algorithm.  That  is,  if  two  processors  each  send  a  message  to  separate 
processors  simultaneously,  the  cost  along  the  critical  path  is  that  of  one  message.  We  assume 
that  the  per-word  and  per-message  costs  include  the  time  it  takes  a  processor  to  pack  words 
into  a  contiguous  message  before  sending  it  over  the  network. 

We  assume  that  (1)  the  architecture  is  homogeneous  (that  is,  7  is  the  same  on  all  pro¬ 
cessors  and  a  and  f3  are  the  same  between  each  pair  of  processors),  (2)  processors  can 
send/receive  only  one  message  to/from  one  processor  at  a  time,  and  (3)  there  is  no  commu¬ 
nication  resource  contention  among  processors.  That  is,  we  assume  that  there  is  a  link  in 
the  network  between  each  pair  of  processors.  Thus  lower  bounds  derived  in  this  model  are 
valid  for  any  network,  but  attainability  of  the  lower  bounds  depends  on  the  details  of  the 
network. 

2. 2. 2.1  Related  Models 

We  note  that  there  are  many  related  parallel  models.  The  parallel  random  access  machine 
(PRAM)  model  [67]  was  designed  to  separate  concerns  of  communication  from  parallel  ef¬ 
ficiency  given  large  numbers  of  processors  and  uses  a  shared-memory  model.  One  of  many 
later  variants,  the  LPRAM  model  [4]  captures  communication  complexity  but  still  incorpo¬ 
rates  a  global  shared  memory.  The  Bulk  Synchronous  Parallel  (BSP)  model  [146]  addresses 
distributed-memory  parallel  machines  and  separates  algorithmic  time  into  synchronous  com¬ 
putational  and  communication  steps.  The  “communication  cost”  in  that  model  is  equivalent 
to  the  bandwidth  cost  in  our  model,  and  the  “synchronization  cost”  is  closely  related  to 
our  latency  cost,  though  in  the  BSP  model  a  processor  can  communicate  with  multiple  pro¬ 
cessors  simultaneously.  The  LogP  model  [54]  is  also  very  similar  to  our  model.  Using  the 
notation  in  Section  2.1.3,  a  is  roughly  equivalent  to  the  sum  of  the  “latency”  and  “overhead” 
parameters,  and  [I  and  the  “gap”  parameters  are  also  roughly  equivalent. 
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(c)  Recursive 


Figure  2.2:  Main  types  of  data  layouts  used  for  storing  matrices  in  slow  memory. 


2. 2. 2. 2  Shared-Memory  Parallel  Models 

We  note  that  because  of  the  current  ubiquity  of  multicore  processors,  shared-memory  parallel 
models  are  also  important  to  consider  in  estimating  algorithmic  performance.  While  we  do 
not  explicitly  consider  shared-memory  models  in  this  thesis,  there  are  several  models  to 
which  it  is  possible  to  extend  both  our  lower  bound  analyses  and  algorithms  [46,  130,  145]. 
Another  trend  in  hardware  design  is  greater  heterogeneity  among  processing  units.  See  [18] 
for  a  shared-memory  heterogeneous  parallel  model  with  an  extension  of  the  lower  bounds 
and  new  algorithms  for  matrix  multiplication. 


2.3  Data  Layouts 

2.3.1  Matrix  Layouts  in  Slow  Memory 

We  consider  three  main  types  of  matrix  data  layouts  in  slow  memory:  column  major,  block 
contiguous,  and  recursive.  There  are  simple  variations  of  these  layouts,  like  row-major 
instead  of  column-major  ordering,  that  we  do  not  discuss.  Adapting  the  analysis  to  these 
variations  is  straightforward. 

The  column-major  data  layout  stores  the  matrix  entries  in  column-wise  order,  as  shown 
in  Figure  2.2a.  That  is,  each  column  is  stored  contiguously,  ordering  column  entries  from 
top  to  bottom,  and  columns  are  ordered  from  left  to  right.  This  is  the  most  commonly  used 
layout  (e.g.,  in  Fortran)  because  the  indexing  into  a  linear  array  is  simple,  and  column  major 
is  the  data  layout  of  choice  in  LAPACK  [8]. 

The  block- contiguous  data  layout  involve  a  block  size  parameter  b,  storing  the  matrix 
in  a  way  that  b  x  b  blocks  are  stored  contiguously,  as  shown  in  Figure  2.2b.  The  entries 
within  a  block  may  be  stored  in  any  layout,  and  the  ( n/b )2  blocks  may  be  ordered  using 
any  (possibly  different)  layout.  We  generally  assume  column-major  orderings  both  inside 
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Figure  2.3:  Block-cyclic  matrix  distribution.  Each  of  the  16  submatrices  shown  on  the  left 
has  exactly  the  same  distribution.  The  colored  blocks  are  the  ones  owned  by  processor  00. 
On  the  right  is  a  zoomed-in  view  of  one  submatrix,  showing  which  processor,  numbered  with 
row  and  column  indices,  owns  each  block. 


and  among  blocks.  Alternatively,  the  elements  of  each  block  may  be  stored  as  contiguous 
sub-blocks,  where  each  sub-block  is  of  size  b'  <  b.  The  data  structure  may  include  several 
such  layers  of  sub-blocks.  The  contiguous  blocks  need  not  be  square,  but  the  rectangular 
generalization  will  not  be  useful  in  this  thesis. 

The  recursive  layout  [70,  154]  is  also  known  as  the  bit-interleaved  layout,  space-hlling 
curve  storage,  or  Morton  ordering  format.  This  layout  stores  each  of  the  four  (n/2)  x  (n/2) 
submatrices  of  a  square  matrix  contiguously,  and  then  the  elements  of  each  submatrix  are 
ordered  so  that  the  smaller  submatrices  are  each  stored  contiguously,  and  so  on  recursively, 
as  shown  in  Figure  2.2c.  The  recursive  layout  can  be  extended  to  rectangular  matrices  in 
various  ways,  see  [32]  for  an  example. 

We  note  that  for  symmetric  matrices,  only  half  the  matrix  needs  to  accessed  (or  stored). 
See  [24,  Section  3.1.1]  for  a  summary  of  adaptations  of  each  of  these  data  structures  to 
symmetric  matrices. 

2.3.2  Matrix  Distributions  on  Parallel  Machines 

In  the  parallel  case,  we  consider  one  general  scheme  for  distributing  matrices  across  the 
local  memories  of  the  processors:  the  block-cyclic  distribution,  shown  in  Figure  2.3.  The 
block-cyclic  distribution  involves  two  block  size  parameters,  br  and  bc,  and  divides  the  matrix 
into  br  x  bc  blocks.  For  a  rectangular  m  x  n  matrix,  these  ( m/br )  •  ( n/bc )  blocks  are  then 
distributed  to  processors  in  a  round-robin  fashion.  We  generally  assume  a  two-dimensional 
logical  processor  grid,  so  blocks  are  distributed  so  that  all  blocks  in  a  given  block  row  are 
distributed  among  a  single  processor  row  (and  similarly  for  block  columns). 

The  block-cyclic  distribution  is  general  enough  to  encompass  many  important  distribu¬ 
tions,  and  it  is  the  method  of  choice  of  ScaLAPACK  [44] .  For  example,  choosing  br  =  bc  =  1 
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gives  the  cyclic  layout  (used  in  Elemental  [121]),  and  choosing  br  =  m/Pr  and  bc  =  n/Pc  with 
a  Pr  x  Pc  processor  grid  gives  the  block  layout.  Column-cyclic,  row-cyclic,  block-column, 
and  block-row  distributions  can  all  be  achieved  with  appropriate  choices  of  br  and  bc. 

As  mentioned  in  Section  2.2.2,  we  assume  the  communication  parameters  a  and  f3  include 
the  costs  incurred  by  a  processor  for  locally  packing  words  into  messages  before  sending  them 
to  another  processor.  Thus,  we  do  not  specify  the  layout  each  processor  uses  to  store  data 
in  its  local  memory. 


2.4  Fast  Matrix  Multiplication  Algorithms 

In  this  section  we  give  two  fast  algorithms  for  matrix  multiplication.  They  each  have  com¬ 
putational  cost  of  0(nlg7),  where  lg T  ~  2.81,  so  they  require  fewer  flops  than  a  classical 
algorithm.  We  note  that  we  present  them  as  recursive  algorithms ,  but  they  specify  computa¬ 
tions  more  generally,  as  the  recursion  tree  can  be  traversed  in  many  different  orderings.  See 
Section  2.1.2  for  a  differentiation  between  algorithms  and  computations. 

2.4.1  Strassen’s  Algorithm 

Here  we  give  Strassen’s  algorithm  for  matrix  multiplication  in  its  original  notation  [139]. 
First,  divide  the  input  matrices  A,  B  and  output  matrix  C  into  4  submatrices: 

_  An  A12  „  _  Bn  B12  ^  _  Cn  C12 

A2i  A22  -B21  B22  C21  C22 

Then,  for  one  step  of  the  algorithm,  compute  the  following  quantities: 


I 

—  (4n  +  A22) 

■  (Bn  +  B22 

II 

—  (4.21  +  422) 

■Bn 

III 

—  An  •  (B12  — 

B22) 

IV 

—  A22  •  (-B21  — 

Bn) 

V 

—  (An  +  4i2) 

■  B22 

VI 

=  (421  —  An) 

■  (Bn  +  Bu 

VII 

=  (A12  —  A22) 

■  (B21  +  B22 

Cn 

=  I  +  IV-V 

+  VII 

C 12 

=  11  + IV 

C21 

=  III  +  V 

C22 

=  I  +  III  -  II  +  VI 

The  algorithm  is  recursive  since  it  can  be  used  for  each  of  the  7  smaller  matrix  multiplica¬ 
tions.  The  recursion  for  the  computational  cost  of  the  algorithm  is  F(n)  =  7F(n/2)  +  18n2, 
yielding  a  solution  of 

F(n)  =  6nlg  7  —  5  n2 

for  n  a  power  of  two  and  using  a  base  case  of  n  —  1. 
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2.4.2  Strassen-Winograd  Algorithm 

The  Strassen-Winograd  algorithm  is  usually  preferred  to  Strassen’s  algorithm  in  practice 
since  it  requires  fewer  additions  (15  instead  of  18).  As  before,  we  divide  the  matrices  into 
quadrants.  Then  we  form  7  linear  combinations  of  the  sub  matrices  of  each  of  A  and  B ,  call 
these  Tj  and  Si,  respectively;  multiply  them  pairwise;  then  form  the  submatrices  of  C  as 
linear  combinations  of  these  products: 


T0  = 

An 

So 

=  Bn 

Q  0  — 

To 

■So 

Ui 

—  Q  0 

+  Q3 

Ti  = 

A-12 

Si 

=  B2\ 

Q\  = 

Ti 

■s1 

U2 

=  U-1 

+  Q4 

T2  = 

A2I 

+  A22 

S2 

=  B\2  —  Bn 

Q2  = 

t2 

■s2 

U3 

=  Ci 

+  Q2 

T3  = 

T-2- 

-An 

S3 

=  B22  —  s2 

Q3  = 

T3 

■S3 

Cn 

=  Qo 

+  Qi 

t4  = 

Au 

—  A21 

s4 

=  B  22  —  B 12 

Qa  = 

t4 

■s4 

C\  2 

=  U3 

+  Qb 

t5  = 

A-12 

-t3 

s5 

=  B22 

Qb  = 

t5 

■S5 

C 2 1 

=  u2 

—  Qo 

t6  = 

A22 

So 

—  S3  —  B2 1 

Qo  — 

To 

■So 

C22 

=  u2 

+  Q2 

This 

is  one  step  of 

Strassen-Winograd. 

The  variation 

is 

often  erroneously  attributed  to 

[152]  and  actually  appears  in  [153].  In  practice,  one  often  uses  only  a  few  steps  of  Strassen- 
Winograd,  although  to  attain  0(nlg7)  computational  cost,  it  is  necessary  to  recursively 
apply  it  all  the  way  down  to  matrices  of  size  0(1)  x  0(1).  The  precise  computational  cost 
of  Strassen-Winograd  (for  n  a  power  of  2)  is 

F(n)  =  csnlg7  —  5  n2. 

Here  cs  is  a  constant  depending  on  the  cutoff  point  at  which  one  switches  to  the  classical 
algorithm.  For  a  cutoff  size  of  no,  the  constant  is  cs  =  (2rio  +  4) jnf  '~2  which  is  minimized 
at  no  =  8  yielding  a  computational  cost  of  approximately  3.73nlg7  —  5 n2. 


2.5  Lower  Bound  Lemmas 

Chapters  3-6  establish  several  communication  lower  bounds  for  a  variety  of  computations. 
In  this  section,  we  state  fundamental  definitions  and  lemmas  that  will  be  useful  throughout 
the  subsequent  chapters. 

2.5.1  Loomis-Whitney  Inequality 

Loomis  and  Whitney  [107]  proved  a  geometrical  result  that  provides  a  surface-to-volume 
relationship  in  general  high- dimensional  space.  We  need  only  the  simplest  version  of  their 
result,  which  will  prove  instrumental  in  proving  lower  bounds  for  classical  computations  in 
Chapters  4  and  6. 

Lemma  2.7  ([107]).  Let  V  be  a  finite  set  of  lattice  points  in  R’,  i.e.,  points  ( x,y,z )  with 
integer  coordinates.  Let  Vx  be  the  projection  of  V  in  the  x-direction,  i.ev  all  points  ( y,z ) 
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such  that  there  exists  an  x  so  that  (. x,y,z )  G  V.  Define  Vy  and  Vz  similarly.  Let  |  •  |  denote 
the  cardinality  of  a  set.  Then  \V\  <  ^/|V^|  x  \Vy\  x  \VZ\. 

An  intuition  for  the  correctness  of  this  lemma  is  as  follows:  think  of  a  box  of  dimensions 
a  x  b  x  c.  Then  its  (rectangular)  projections  on  the  three  planes  have  areas  a  ■  b,  b  ■  c  and 
a  ■  c,  and  we  have  that  its  volume  a  ■  b  ■  c  is  equal  to  the  square  root  of  the  product  of  the 
three  areas.  In  this  instance  equality  is  achieved;  only  the  inequality  applies  in  general. 


2.5.2  Expansion  Preliminaries 

In  this  section  we  define  the  edge  expansion  of  a  graph  and  generalize  it  slightly  for  our 
purposes.  These  preliminaries  will  be  used  in  Chapters  5  and  6. 

We  use  common  graph  notation  here,  with  V  —  V ( G )  denoting  the  set  of  vertices  and 
E  =  E{G)  the  set  of  edges  of  a  graph  G  =  (V,  E).  We  also  use  the  symbol  \  to  denote  set 
subtraction.  Recall  that  a  d-regular  graph  is  one  whose  vertices  all  have  degree  d. 

Definition  2.8.  The  edge  expansion  h(G)  of  a  d-regular  undirected  graph  G  =  (V,  E)  is: 


h(G) 


min 

UCV,\U\<\V\/2 


d-\U\ 


(2.1) 


where  E(A,  B )  is  the  set  of  edges  connecting  the  vertex  sets  A  and  B. 

If  a  graph  G  =  (R,  E )  is  not  regular  but  has  a  bounded  maximal  degree  d,  then  we 
can  add  (<  d)  loops  to  vertices  of  degree  <  d,  obtaining  a  regular  graph  G' .  We  use  the 
convention  that  a  loop  adds  1  to  the  degree  of  a  vertex.  Note  that  for  any  S  C  V,  we  have 
|  Eg(S,  R\S)|  =  \EafiS,  E\S')|,  as  none  of  the  added  loops  contributes  to  the  edge  expansion 
of  G'. 

For  many  graphs,  small  sets  expand  better  than  larger  sets.  Let  hs(G)  denote  the  edge 
expansion  for  sets  of  size  at  most  s  in  G: 


hs{G) 


min 

UCV,\U\<8 


ijmvyun 

d-  \U\ 


(2.2) 


For  many  interesting  graph  families,  hs(G )  does  not  depend  on  |R(G)|  when  s  is  fixed, 
although  it  may  decrease  when  s  increases.  One  way  of  bounding  hs(G )  is  by  decomposing 
G  into  small  subgraphs  of  large  edge  expansion.  Let  us  first  define  graph  decomposition: 

Definition  2.9.  We  say  that  the  set  of  graphs  {G't  =  (R,  Ef)}i<i<i  is  an  edge-disjoint 
decomposition  of  G  =  (V,E)  ifV  =  |J i  R  and  E  =  (+j iEi. 

Now,  we  state  and  prove  a  lemma  relating  the  small-set  edge  expansion  of  a  graph  based 
on  the  edge  expansion  of  its  component  graphs. 
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Lemma  2.10.  Let  G  =  ( V ,  E)  be  a  d-regular  graph  that  can  be  decomposed  into  edge-disjoint 
(but  not  necessarily  vertex- disjoint)  copies  of  a  graph  G'  =  (V7,  E')  with  maximum  degree  d' . 
Then  the  edge  expansion  of  G  for  sets  of  size  at  most  \  V'\/2  is  h{G')  ■  (j-,  namely 


h\v'\  (G) 


min 

UCV,\U\<\V'\/2 


Eg(U,V\U)\ 
d  ■  \U\ 


>  h(G')  ■ 


d!_ 

~d  ' 


Proof.  Let  U  C  V  be  of  size  U  <  | V'7 1 /2.  Let  {G't  =  (Vi,  Ef)}i<i<i  be  an  edge-disjoint 
decomposition  of  G,  where  every  G\  is  isomorphic  to  G' .  Let  Ui  =  Vi  fl  U.  Then 


\Ea(U,V\U)\  =  Y,\Ea-Xi,Vi\Ui)\>J2h(G'i)-d'-\Ui\ 

i=  1  i= 1 

l 

=  h(G')-d!  ■J2\Ui\>h(G')-d'  -\U\  . 
Therefore  |Eg(^W)I  >  h{G')  ■  f  . 


i=  1 


□ 


2.5.3  Latency  Lower  Bounds 

In  the  subsequent  chapters,  we  state  all  results  in  terms  of  the  number  of  words  moved. 
Corresponding  bounds  on  the  number  of  messages  moved  exist  with  a  very  simple  proof: 
divide  the  bandwidth  cost  lower  bound  by  the  largest  possible  message  size.  Thus,  we  have 
the  following  theorem: 

Theorem  2.11.  Suppose  a  computation  has  a  bandwidth  cost  lower  bound  of  W  >  W' .  If 
L  is  the  largest  message  size,  then  its  latency  cost  lower  bound  is  S  >  W'/L. 

Because  it  applies  so  generally,  we  do  not  restate  the  latency  cost  lower  bound  for  every 
bandwidth  cost  lower  bound  result. 


2.6  Numerical  Stability  Lemmas 

In  this  section  we  define  our  model  of  floating  point  computation  and  state  several  well-known 
lemmas  regarding  the  component- wise  backward  stability  of  fundamental  linear  algebra  com¬ 
putations.  Nearly  all  of  the  results  in  this  section  can  be  found  in  [85]. 

Our  model  of  floating  point  arithmetic  is: 

fl(xopy)  =  (xopy)(l  +  S),  |<5|  <  u,  op  =  +, -,  x,q-,  (2.3) 

where  u  is  unit  roundoff  [85,  Section  2.2],  In  other  words,  we  ignore  underflow  and  overflow. 
We  also  assume  that  0.5  is  a  floating  point  number  and  that 


fl  (0.5a;)  =  0.5a;. 


(2.4) 
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We  begin  by  citing  a  few  lemmas  of  floating  point  error  analysis,  all  of  which  are  either 
well  known  or  can  be  easily  derived  from  well-known  results.  We  use  the  notation  = 
nuj  (1  —  nu )  for  any  positive  n.  The  following  lemma  provides  a  rule  for  manipulating 
expressions  involving  or  quantities  bounded  by  it. 

Lemma  2.12  ([85,  Lemma  3.3]).  The  bound 

r{m  +  7 n  +  TmTin  Cl  '  Ym+n 

holds.  Furthermore,  if  9m  and  9n  are  such  that  \0m\  <  and  \9n\  <  then 

(1  T  9nf)  (IT  9,f)  —  1  T  9m-\-n,  |$m+n|  A  Im+n- 

The  following  lemma  provides  a  bound  on  the  accuracy  of  matrix-matrix  products.  In  our 
analysis  we  assume  that  matrices  are  multiplied  using  the  conventional  method,  as  opposed 
to  Strassen’s  algorithm  or  any  related  scheme. 

Lemma  2.13  ([85,  Section  3.5]).  Let  A  and  B  be  m  x  p  and  p  x  n  matrices  respectively.  If 
the  product  X  =  AB  is  formed  in  floating  point  arithmetic,  then 

X  =  AB  +  A,  |A|  <  lp\A\  \B\ . 

The  following  two  lemmas  also  deal  with  matrix-matrix  multiplication.  Their  proofs 
are  similar  to  the  proof  of  Lemma  8.4  in  [85],  with  the  only  difference  stemming  from  the 
possible  scaling  by  0.5  in  Lemma  2.14.  The  assumption  (2.4)  in  our  model  guarantees  that 
this  scaling  has  no  effect  on  the  ultimate  bound. 

Lemma  2.14.  Let  A,  B  andC  be  matrices  of  dimensions  mxp,  pxn  andmxn  respectively, 
and  let  a  be  one  of  the  scalars  0.5  and  1.  If  the  matrix  X  =  C  —  aAB  is  formed  in  floating 
point  arithmetic  then 

C  =  aAB  +  X  T  A,  |A|  <  7p  (a \A\  \B\  +  \X\) . 

Lemma  2.15.  Let  A,  B  andC  be  matrices  of  dimensions  mxp,  pxm  andmxm  respectively. 
If  the  matrix  X  =  C  —  AB  —  ( AB)T  is  formed  in  floating  point  arithmetic  then 

C  =  AB  T  ( AB)t  T  X  +  A,  | A |  <  l2p  (jA|  \B\  +  (\A\  \B\)T  +  |X|)  . 

Finally,  the  following  two  lemmas  provide  bounds  on  the  accuracy  of  triangular  solves 
and  of  the  LU  factorization. 

Lemma  2.16  ([85,  Section  8.1]).  Let  T  and  B  be  matrices  of  dimensions  m  x  m  and  m  x  n 
respectively,  and  assume  that  T  is  triangular.  If  the  m  x  n  matrix  X  is  computed  by  solving 
the  system  TX  =  B  using  substitution  in  floating  point  arithmetic  then 

TX  =  B  TA,  |A|  <  7m  \T\  \X\ . 

Furthermore,  if  the  system  being  solved  is  XT  =  B  and  the  dimensions  of  X  and  B  are 
n  x  m  then 


XT  =  B  TA,  |  A  |  <  7m  \X\  \T\ . 
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Lemma  2.17  ([85,  Section  9.3]).  Let  A  be  an  m  x  n  matrix  and  let  r  =  min  {m,n}.  If  L 
and  U  are  the  LU  factors  of  A,  computed  in  floating  point  arithmetic,  then 

A  =  LU  +  A,  |A|  <  7r  \L\  \U\ . 
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Part  I 

Communication  Lower  Bounds 
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Chapter  3 

Communication  Lower  Bounds  via 
Reductions 


We  establish  communication  lower  bounds  for  three  fundamental  classical  computations  in 
this  chapter:  matrix  multiplication,  LU  decomposition,  and  Cholesky  decomposition.  In 
Chapters  4-6,  we  extend  these  lower  bound  results  in  various  ways.  The  main  contributions 
of  this  chapter  are 

•  a  summary  of  known  communication  lower  bounds  for  classical  matrix  multiplication, 

•  a  reduction  proof  to  extend  the  bounds  to  classical  LU  decomposition,  and 

•  a  reduction  proof  to  extend  the  bounds  to  classical  Cholesky  decomposition. 

That  is,  we  show  how  to  perform  matrix  multiplication  using  a  black-box  call  to  an  LU  or 
Cholesky  factorization  routine.  Thus,  we  prove  that  a  lower  bound  that  applies  to  matrix 
multiplication  also  applies  to  these  decompositions  (under  the  same  assumptions). 

The  content  of  this  chapter  appears  in  both  [23]  (conference  version)  and  [24]  (journal 
version),  written  with  coauthors  James  Dcmmcl,  Olga  Holtz,  and  Oded  Schwartz.  The 
journal  version  of  the  paper  includes  a  more  detailed  discussion  of  the  algorithms,  but  the 
lower  bound  argument  for  Cholesky,  presented  in  Section  3.2.2,  appears  in  both  papers.  The 
simpler  argument  for  LU,  presented  in  Section  3.2.1,  also  appears  in  [80]. 


3.1  Classical  Matrix  Multiplication 

In  1981  Hong  and  Kung  [88]  proved  a  lower  bound  on  the  bandwidth  cost  required  to 
perform  dense  matrix  multiplication  in  the  sequential  two-level  memory  model  using  the 
classical  algorithm,  where  the  input  matrices  are  too  large  to  fit  in  the  fast  memory.  They 
obtained  the  following  result  using  what  they  called  a  “red-blue  pebble  game”  analysis  of 
the  computation  graph  of  the  algorithm.  For  algorithms  attaining  this  bound,  see  Section 
7.1.1. 
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Theorem  3.1  ([88,  Corollary  6.2]).  For  classical  matrix  multiplication  of  dense  m  x  k  and 
kxn  matrices  implemented  on  a  machine  with  fast  memory  of  size  M ,  the  number  of  words 
transferred  between  fast  and  slow  memory  is 

*  =  «(*£)• 

This  result  was  proven  using  a  different  technique  by  Irony,  Toledo,  and  Tiskin  [95] 
and  generalized  to  the  distributed-memory  parallel  case.  They  state  the  following  parallel 
bandwidth  cost  lower  bound  using  an  argument  based  on  the  Loomis- Whitney  inequality 
[107],  given  as  Lemma  2.7. 

Theorem  3.2  ([95,  Theorem  3.1]).  For  classical  matrix  multiplication  of  dense  m  x  k  and 
k  x  n  matrices  implemented  on  a  distributed-memory  machine  with  P  processors  each  with 
a  local  memory  of  size  M ,  the  number  of  words  communicated  by  at  least  one  processor  is 

»-■>(; 3ft-") 

In  the  case  where  m  —  k  —  n  and  each  processor  stores  the  minimal  M  =  0(n2/P)  words 
of  data,  the  lower  bound  on  bandwidth  cost  becomes  f2(n2 / P1/2).  The  authors  also  consider 
the  case  where  the  local  memory  size  is  much  larger,  M  =  0(n2/P2//3),  in  which  case  0{P 1^3) 
times  as  much  memory  is  used  (compared  to  the  minimum  possible)  and  less  communication 
is  necessary.  In  this  case  the  bandwidth  cost  lower  bound  becomes  Vl{n2 / P2/3).  See  Sections 

8.1.1  and  8.2.1  for  discussions  of  algorithms  that  attain  this  bound. 

3.2  Reduction  Arguments 

3.2.1  LU  Decomposition 

Given  a  lower  bound  for  one  algorithm,  we  can  make  a  reduction  argument  to  extend  that 
bound  to  another  algorithm.  In  our  case,  given  the  matrix  multiplication  bounds  from 
Section  3.1,  if  we  can  show  how  to  perform  matrix  multiplication  using  another  algorithm 
(assuming  the  transformation  requires  no  extra  communication  in  an  asymptotic  sense),  then 
the  same  bound  must  apply  to  the  other  algorithm,  under  the  same  assumptions. 

A  reduction  of  matrix  multiplication  to  LLT  decomposition  is  straightforward  given  the 
following  identity: 

I  0  -B\  (I  \  (I  0  -B\ 

A  I  0  =  l  A  I  I  A- B\.  (3.1) 

007/  \0  0  I)  \  I  ) 

That  is,  given  two  input  matrices  A  and  B,  we  can  compute  A  ■  B  by  constructing  the  matrix 
on  the  left  hand  side  of  the  identity  above,  performing  an  LU  decomposition,  and  then 
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extracting  the  (2,  3)  block  of  the  upper  triangular  output  matrix.  Thus,  given  an  algorithm 
for  LU  decomposition  that  communicates  less  than  the  lower  bound  for  multiplication,  we 
have  an  algorithm  for  matrix  multiplication  that  communicates  less  than  the  lower  bound, 
a  contradiction.  Note  that  although  the  dimension  of  the  LU  decomposition  is  three  times 
that  of  the  original  multiplication,  the  same  communication  bound  holds  in  an  asymptotic 
sense.  This  reduction  appears  in  [80].  Stated  formally: 

Theorem  3.3.  Given  a  fast/local  memory  of  size  M ,  the  bandwidth  cost  lower  bound  for 
classical  L  U  decomposition  of  a  dense  n  x  n  matrix  is 

w  =  a(p£p)- 

For  a  discussion  of  algorithms  attaining  this  bound,  see  Sections  7.1.4  and  8.1.4. 


3.2.2  Cholesky  Decomposition 

A  similar  identity  to  Equation  3.1  holds  for  Cholesky  decomposition: 

(I  AT  -B\  (I  \  (I  AT  -B\ 

(  A  I  +  A-  At  0  =  \  A  I  •  /  A- B) 

\-BT  0  D  )  \-BT  ( A-B)t  Xj  \  XT  ) 

where  X  is  the  Cholesky  factor  of  D'  =  D  —  BTB  —  BTATAB,  and  D  can  be  any  symmetric 
matrix  such  that  D'  is  positive  definite. 

However,  the  reduction  is  not  as  straightforward  as  in  the  case  of  LU  because  the  matrix- 
multiplication- by-Cholesky  algorithm  would  include  the  computation  of  A-AT  which  requires 
as  much  communication  as  general  matrix  multiplication.1  We  next  show  how  to  change  the 
computation  so  that  we  can  avoid  constructing  the  I +  A-  AT  term  and  still  perform  Cholesky 
decomposition  to  obtain  the  product  A  ■  B. 

In  addition  to  the  real  numbers  M,  consider  new  “starred”  numerical  quantities,  called 
1*  and  0*,  with  arithmetic  properties  detailed  in  the  following  tables.  The  quantities  1*  and 
0*  mask  any  real  value  in  addition/subtraction  operation,  but  behave  similarly  to  1  6  R  and 
0  £  1  in  multiplication  and  division  operations. 

Table  3.1  defines  arithmetic  operations  with  these  new  quantities.  Consider  the  set 
RU{1*,0’}  with  the  specified  arithmetic  operations. 

•  The  set  is  commutative  with  respect  to  addition  and  to  multiplication  (by  the  symme¬ 
tries  of  the  corresponding  tables). 

•  The  set  is  associative  with  respect  to  addition:  regardless  of  ordering  of  summation, 
the  sum  is  1*  if  one  of  the  summands  is  1*,  otherwise  it  is  0*  if  one  of  the  summands 
is  0*. 


1To  see  why,  take  A 


X  0 
Yt  0 
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Table  3.1:  Arithmetic  operations  with  starred  values.  The  variables  x,y  stand  for  any  real 
values.  For  consistency,  —0*  =  0*  and  —1*  =  1*. 

•  The  set  is  also  associative  with  respect  to  multiplication:  (a  ■  b)  ■  c  =  a  ■  (b  ■  c).  This 
is  trivial  if  all  factors  are  in  M.  As  1*  is  a  multiplicative  identity,  it  is  also  immediate 
if  some  of  the  factors  equal  1*.  Otherwise,  at  least  one  of  the  factors  is  0*,  and  the 
product  is  0. 

•  Distributivity,  however,  does  not  hold:  1  •  (1*  +  1*)  =  1  ^  2  —  (1  •  1*)  +  (1  •  1*). 

Let  us  return  to  the  construction.  We  set  T  to  be: 

(  I  AT  -B\ 

T  =  \  A  C  0 
\-BT  0  C  ) 

where  C  has  1*  on  the  main  diagonal  and  0*  everywhere  else: 


One  can  verify  that  the  (unique)  Cholesky  decomposition  of  C  is 


Note  that  if  a  matrix  X  does  not  contain  any  “starred”  values  0*  and  1*  then  X  —  C-X  = 
X  ■  C  —  C1  ■  X  —  X  ■  C'  —  C"  ■  X  =  X  ■  C"  and  C  +  X  =  C.  Therefore,  one  can  confirm 
that  the  Cholesky  decomposition  of  T  is: 

(I  At  -B\  (I  \  (I  AT  -B\ 

T  =  (  A  C  0  I  =  I  A  C'  •  C'T  A  ■  B  =  L  ■  Lr .  (3.3) 

\-bt  o  c )  \-bt  (a  ■  b)t  a)  V  C'T  ) 
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One  can  think  of  C  as  masking  the  A  ■  AT  previously  appearing  in  the  central  block 
of  T,  therefore  allowing  the  lower  bound  of  computing  A  ■  B  to  be  accounted  for  by  the 
Cholesky  decomposition,  and  not  by  the  computation  of  A  ■  AT .  More  formally,  let  Alg  be 
any  classical  algorithm  for  Cholesky  factorization.  We  convert  it  to  a  matrix  multiplication 
algorithm  using  Algorithm  3.1. 


Algorithm  3.1  Matrix  Multiplication  by  Cholesky  Decomposition 
Require:  Two  n X n  matrices,  A  and  B 
1:  Let  Alg'  be  Alg  updated  to  correctly  handle  the  new  0*  and  1*  values 

>  note  that  Alg'  can  be  constructed  off-line 

2:  Construct  T  as  in  Equation  (3.3) 

3:  L  =  Alg'iT ) 

4:  return  (L3 2)T 


The  simplest  conceptual  way  to  implement  line  1  of  the  algorithm  is  to  attach  an  extra 
bit  to  every  numerical  value,  indicating  whether  it  is  “starred”  or  not,  and  modify  every 
arithmetic  operation  to  check  this  bit  before  performing  an  operation.  This  increases  the 
bandwidth  cost  by  at  most  a  constant  factor.  Alternatively,  we  can  use  Signalling  NaNs  as 
defined  in  the  IEEE  Floating  Point  Standard  [92]  to  encode  1*  and  0*  with  no  extra  bits. 

If  the  instructions  implementing  Cholesky  are  scheduled  deterministically,  there  is  an¬ 
other  alternative.  One  can  run  the  algorithm  “symbolically”,  propagating  0*  and  1*  ar¬ 
guments  from  the  inputs  forward,  simplifying  or  eliminating  arithmetic  operations  whose 
inputs  contain  0*  or  1*.  One  can  also  eliminate  operations  for  which  there  is  no  path  in 
the  directed  acyclic  graph  (describing  how  outputs  of  each  operation  propagate  to  inputs  of 
other  operations)  to  the  desired  output  A  ■  B.  The  resulting  Alg'  performs  a  strict  subset  of 
the  arithmetic  and  memory  operations  of  the  original  Cholesky  algorithm. 

We  note  that  updating  Alg  to  form  Alg'  is  done  off-line,  so  that  line  1  does  not  actually 
take  any  time  to  perform  when  Algorithm  3.1  is  called. 

3. 2. 2.1  Correctness  of  Algorithm  3.1 

We  next  verify  the  correctness  of  this  reduction:  that  the  output  of  this  procedure  on  input 
A  and  B  is  indeed  the  multiplication  A  ■  B,  as  long  as  Alg  is  a  classical  algorithm,  in  a  sense 
we  now  define  carefully. 

Let  T  =  L-LT  be  the  Cholesky  decomposition  of  T.  Then  we  have  the  following  formulas: 


L(i,i)  = 


\ 


i— 1 


T(M)  -  ^2(L(i,k)Y 


k= 1 


3- 1 


T(i,j)  ~  '  L(j,k)  ,i>j- 


k=  1 


(3.4) 


(3.5) 
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A  “classical”  Cholesky  decomposition  algorithm  computes  each  of  these  0(n3)  flops,  which 
may  be  reordered  using  only  commutativity  and  associativity  of  addition.  By  the  no-pivoting 
and  no-distributivity  restrictions  on  Alg,  when  an  entry  of  L  is  computed,  all  the  entries  on 
which  it  depends  have  already  been  computed  and  combined  by  the  above  formulas,  with  the 
sums  occurring  in  any  order.  These  dependencies  form  a  dependency  graph  on  the  entries 
of  L,  and  impose  a  partial  ordering  on  the  computation  of  the  entries  of  L  (see  Figure  3.1). 
That  is,  when  an  entry  L(i,i)  is  computed,  by  Equation  (3.4),  all  the  entries  {L(i,  k)}\<k<i 
have  already  been  computed.2  Denote  this  set  of  entries  by  Stt,  namely, 

Su  =  {L(i,k)}  i<k<i.  (3.6) 

Similarly,  when  an  entry  L(i,j )  (for  i  >  j )  is  computed,  by  Equation  (3.5),  all  the  entries 
{L(i,  k)}i<k<j  and  all  the  entries  {L(j,  k)}i<k<j  have  already  been  computed.  Denote  this 
set  by  Sij  namely, 

Sij  =  {L(i,  k)} i<fc<i  U  {L(j,  k)}i<k<j.  (3.7) 


Figure  3.1:  Dependencies  of  L(i,j),  for  diagonal  entries  (left)  and  other  entries  (right). 
Dark  grey  represents  the  sets  Stl  (left)  and  (right).  Light  grey  represents  indirect  depen¬ 
dencies. 


Lemma  3.4.  Any  ordering  of  the  computation  of  the  elements  of  L  that  respects  the  partial 
ordering  induced  by  the  computation  graph  results  in  a  correct  computation  of  A  ■  B. 

Proof.  We  need  to  confirm  that  the  starred  entries  1*  and  0*  of  T  do  not  somehow  “con¬ 
taminate”  the  desired  entries  of  Lff2.  The  proof  is  by  induction  on  the  partial  order  on  pairs 
(i,j)  implied  by  (3.6)  and  (3.7).  The  base  case  — the  correctness  of  computing  L(l,l)— 
is  immediate.  Assume  by  induction  that  all  elements  of  are  correctly  computed  and 
consider  the  computation  of  L{i,j )  according  to  the  block  in  which  it  resides: 

•  If  L(i,j)  resides  in  block  Ln,  L2 1  or  L31  then  contains  only  real  values,  and  no 
arithmetic  operations  with  0*  or  1*  occur  (recall  Figure  3.1  or  Equations  (3. 3), (3. 6) 

2While  this  partial  ordering  constrains  the  scheduling  of  flops,  it  does  not  uniquely  identify  a  computation 
DAG  (directed  acyclic  graph),  for  the  additions  within  one  summation  can  be  in  arbitrary  order  (forming 
arbitrary  subtrees  in  the  computation  DAG). 
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and  (3.7)).  Therefore,  the  correctness  follows  from  the  correctness  of  the  original 
Cholesky  algorithm. 

•  If  L(i,j)  resides  in  L-2 2  or  L33  then  SVJ  may  contain  “starred”  value  (elements  of  C'). 
We  treat  separately  the  case  where  L(i,j)  is  on  the  main  diagonal  and  the  case  where 
it  is  not. 

If  i  =  j  then  by  Equation  (3.4)  L(i,i)  is  determined  to  be  1*  since  T(i,i )  =  1*  and 
since  adding  to,  subtracting  from  and  taking  the  square  root  of  1*  all  result  in  1*  (recall 
Table  3.1  and  Equation  (3.4)). 

If  i  >  j  then  by  the  inductive  assumption  the  divisor  L(j,j )  of  Equation  (3.5)  is 
correctly  computed  to  be  1*  (recall  Figure  3.1  and  the  definition  of  C'  in  Equation 
(3.2)).  Therefore,  no  division  by  0*  is  performed.  Moreover,  T(i,j)  is  0*.  Then  L(i,j ) 
is  determined  to  be  the  correct  value  0*,  unless  1*  is  subtracted  (recall  Equation  (3.5)). 
However,  every  subtracted  product  (recall  Equation  (3.5))  is  composed  of  two  factors 
of  the  same  column  but  of  different  rows.  Therefore,  by  the  structure  of  C ,  none  of 
them  is  1*  so  their  product  is  not  1*  and  the  value  is  computed  correctly. 

•  If  L(i,j)  resides  in  L32  then  StJ  may  contain  “starred”  values  (see  Figure  3.1,  right- 
hand  side,  row  j).  However,  every  subtraction  performed  (recall  Equation  (3.5))  is 
composed  of  a  product  of  two  factors,  of  which  one  is  on  the  ith  row  (and  on  a  column 
k  <  j).  Hence,  by  induction  (on  i,j ),  the  (i,k)  element  has  been  computed  correctly 
to  be  a  real  value,  and  by  the  multiplication  properties  so  is  the  product.  Therefore 
no  masking  occurs. 

This  completes  the  proof  of  Lemma  3.4.  □ 

3. 2. 2. 2  Lower  Bound  Argument 

We  now  know  that  Algorithm  3.1  correctly  multiplies  matrices  “classically”,  and  so  has 
known  communication  lower  bounds  given  by  Theorems  3.1  and  3.2.  It  remains  to  confirm 
that  Step  2  (setting  up  T)  and  Step  4  (returning  Lj2)  do  not  require  much  communication,  so 
that  these  lower  bounds  apply  to  Step  3,  running  Alg'  (recall  that  Step  1  may  be  performed 
off-line  and  so  doesn’t  count).  Since  Alg'  is  either  a  small  modification  of  Cholesky  to  add 
“star”  labels  to  all  data  items  (at  most  doubling  the  bandwidth  cost),  or  a  subset  of  Cholesky 
with  some  operations  omitted  (those  with  starred  arguments,  or  not  leading  to  the  desired 
output  L32),  a  lower  bound  on  communication  for  Alg'  is  also  a  lower  bound  for  Cholesky. 

That  is,  any  sequential  or  parallel  classical  algorithm  for  the  Cholesky  decomposition  of  n- 
by-n  matrices  can  be  transformed  into  a  classical  algorithm  for  |-by-|  matrix-multiplication, 
in  such  a  way  that  the  bandwidth  cost  of  the  matrix-multiplication  algorithm  is  at  most  a 
constant  times  the  bandwidth  cost  of  the  Cholesky  algorithm.  Therefore  any  bandwidth 
or  latency  cost  lower  bound  for  classical  matrix  multiplication  applies  to  classical  Cholesky, 
asymptotically  speaking: 
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Theorem  3.5.  Given  a  fast/local  memory  of  size  M ,  the  bandwidth  cost  lower  bound  for 
classical  Cholesky  decomposition  of  a  dense  symmetric  n  x  n  matrix  is 

W  =  n  (  ni  . 

\PMV  2  ) 

Proof.  Constructing  T  (in  any  data  format)  requires  bandwidth  of  at  most  18n2  (copying  a 
3n-by-3n  matrix,  with  another  factor  of  2  if  each  entry  has  a  flag  indicating  whether  it  is 
“starred”  or  not),  and  extracting  Lj2  requires  another  n2  of  bandwidth.  Furthermore,  we 
can  assume  n2  <  n^/M1^2,  i.e.,  that  M  <  n 2,  i.e.,  that  the  matrix  is  too  large  to  fit  entirely 
in  fast  memory  (the  only  case  of  interest).  Thus  the  bandwidth  lower  bound  0{ni / Mx l2)  of 
Algorithm  3.1  dominates  the  bandwidth  costs  of  Steps  2  and  4,  and  so  must  apply  to  Step 
3  (Cholesky).  Finally,  as  each  message  delivers  at  most  M  words,  the  latency  lower  bound 
for  Step  3  is  by  a  factor  of  M  smaller  than  its  bandwidth  cost  lower  bound,  as  desired. 

The  argument  in  the  parallel  case  is  analogous.  The  construction  of  input  and  retrieval 
of  output  at  Steps  2  and  4  of  Algorithm  3.1  contribute  bandwidth  of  0(n2/P).  Therefore 
the  lower  bound  of  the  bandwidth  Q(n3 / (PM1^2))  is  determined  by  Step  3,  the  Cholesky 
decomposition.  The  lower  bound  on  the  latency  of  Step  3  is  therefore  f2(n3 / (PM3/2)),  as 
each  message  delivers  at  most  M  words.  □ 

For  algorithms  attaining  this  bound,  see  Sections  7.1.2  and  8.1.2. 


3.3  Conclusions 

In  this  chapter  we  establish  the  fact  that  LU  and  Cholesky  decompositions  are  as  hard  as 
matrix  multiplication,  in  terms  of  their  communication  requirements.  In  the  Chapter  4,  we 
reproduce  these  results  with  a  different,  direct  proof  and  show  that  the  new  proof  can  be 
applied  to  many  other  computations.  In  fact,  the  later  results  generalize  those  here  because 
they  also  apply  to  sparse  and  rectangular  (in  the  case  of  LU)  matrices,  while  Theorems  3.3 
and  3.5  assume  dense,  square  matrices.  We  include  the  results  of  this  chapter  because  (1) 
they  predate  those  of  Chapter  4  and  (2)  the  proof  technique  may  prove  valuable  for  other 
computations.  For  instance,  the  best  known  lower  bound  results  for  QR  decomposition  (see 
Section  4.3)  require  technical  assumptions;  a  reduction  from  matrix  multiplication  to  QR 
could  make  the  results  more  robust.  In  addition,  there  exists  an  incomplete  reduction  in  [17, 
Section  4.1]  of  computing  a  Schur  decomposition  to  computing  a  QR  decomposition.  Note 
also  that  the  results  here  assume  “classical”  decompositions  (see  Definition  2.1);  algorithms 
that  compute  the  decompositions  with  the  help  of  Strassen’s  or  other  fast  matrix  multipli¬ 
cation  algorithm  can  communicate  less  than  algorithms  for  classical  matrix  multiplication. 
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Chapter  4 

Lower  Bounds  for  Classical  Linear 
Algebra 


While  Chapter  3  extends  the  lower  bounds  for  matrix  multiplication  (given  in  Section  3.1)  via 
reduction  arguments,  this  chapter  presents  a  much  more  general  set  of  arguments  to  establish 
lower  bounds  for  nearly  all  of  classical  linear  algebra.  In  particular,  this  approach  makes 
no  assumption  on  the  sparsity  of  the  matrices  involved  in  the  computation,  so  it  generalizes 
all  of  the  results  in  the  previous  chapter.  Intuitively,  matrix  multiplication  is  the  most 
fundamental  computation  in  numerical  linear  algebra — in  fact,  it  is  used  as  a  subroutine  in 
nearly  every  other  algorithm  in  linear  algebra — so  it  seems  no  surprise  that  any  lower  bound 
for  matrix  multiplication  would  also  apply  to  many  other  computations.  The  goal  of  this 
section  is  to  confirm  that  suspicion  rigorously. 

The  key  observation,  and  the  basis  for  the  arguments  in  this  section,  is  that  the  proof 
technique  of  Irony,  Toledo,  and  Tiskin  [95]  can  be  applied  more  generally  than  just  to  dense 
matrix  multiplication.  The  geometric  argument  is  based  on  the  lattice  of  indices  ( i,j,k ) 
which  corresponds  to  the  updates  C\j  :=  Cij  +  A g.  ■  B^.  However,  the  proof  does  not  depend 
on,  for  example,  the  scalar  operations  being  multiplication  and  addition,  the  matrices  being 
dense,  or  the  input  and  output  matrices  being  distinct.  The  important  property  is  the 
relationship  between  the  indices  ( i,j ,  k)  which  allows  for  the  embedding  of  the  computation 
in  three  dimensions.  These  observations  let  us  state  and  prove  a  more  general  set  of  theorems 
and  corollaries  that  provide  a  lower  bound  on  the  number  of  words  moved  into  or  out  of  a 
fast  or  local  memory  of  size  M:  for  a  large  class  of  classical  linear  algebra  algorithms, 

W  =  Q(G/M1/2), 

where  G  is  proportional  to  the  total  number  of  flops  performed  by  the  processor.  In  other 
words,  a  computation  executed  using  a  classical  linear  algebra  algorithm  requires  at  least 
n(l/VM)  memory  operations  for  every  arithmetic  operation,  or  conversely,  the  maximum 
amount  of  re-use  for  any  word  read  into  fast  or  local  memory  during  such  a  computation  is 
0(\/M)  arithmetic  operations. 
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The  main  contributions  of  this  chapter  are  lower  bound  results  for  dense  or  sparse  ma¬ 
trices,  on  sequential  and  parallel  machines,  for  the  following  computations: 

•  Basic  Linear  Algebra  Subroutines  (BLAS),  including  matrix  multiplication  and  solving 
triangular  systems; 

•  LU,  Cholesky,  LDLT,  LTLT  factorizations,  including  incomplete  versions; 

•  QR  factorization,  including  approaches  based  on  solving  the  normal  equations,  Grarn- 
Schmidt  orthogonalization,  or  applying  orthogonal  transformations; 

•  eigenvalue  and  singular  value  reductions  via  orthogonal  transformations  and  computing 
eigenvectors  from  Schur  form;  and 

•  all-pairs  shortest  paths  computation  based  on  the  Floyd- Warshall  approach. 

This  chapter  is  organized  as  follows.  Section  4.1  generalizes  the  argument  of  [95]  slightly 
and  demonstrates  that  it  applies  to  several  other  fundamental  computations,  including  LU 
and  Cholesky  decompositions.  In  Section  4.2,  we  address  a  set  of  computations  which  violate 
certain  assumptions  in  the  argument  of  Section  4.1  (e.g.,  LDLT  decomposition)  and  show 
how  to  obtain  asymptotically  equivalent  results.  Section  4.3  considers  computations  involv¬ 
ing  orthogonal  transformations,  where  the  lower  bound  arguments  require  more  complicated 
analysis  and  some  further  assumptions. 

All  of  the  results  in  this  chapter  (with  the  exceptions  of  Sections  4. 1.2.4  and  4. 2. 2. 4) 
appear  in  [28],  written  with  coauthors  James  Dcnnncl,  Olga  Holtz,  and  Oded  Schwartz, 
though  the  presentation  of  the  material  has  been  reorganized  into  the  three  sections  described 
above.  The  paper  was  awarded  the  SIAM  Linear  Algebra  Prize  in  2011. 

4.1  Lower  Bounds  for  Three-Nested-Loops 
Computation 

Recall  the  simplest  pseudocode  for  multiplying  n  x  n  matrices,  as  three  nested  loops: 

for  i  =  1  to  n 
for  j  =  1  to  n 
for  k  =  1  to  n 

C[i,j]  =  C[i,j]  +  A[i,k]  *  B[k,j] 

While  the  pseudocode  above  specifies  a  particular  order  on  the  n3  inner  loop  iterations,  any 
complete  traversal  of  the  index  space  yields  a  correct  computation  (and  all  orderings  generate 
equivalent  results  in  exact  arithmetic).  As  we  will  see,  nearly  all  of  classical  linear  algebra 
can  be  expressed  in  a  similar  way — with  three  nested  loops. 
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Note  that  the  matrix  multiplication  computation  can  be  specified  more  generally  in  math¬ 
ematical  notation: 

Cjj  ^  ^  Ajl;  Bkj  , 

k 

where  the  order  of  summation  and  order  of  computation  of  output  entries  are  left  undefined. 
In  this  chapter,  we  specify  computations  in  this  general  way,  but  we  will  use  the  term  “three- 
nested-loops”  to  refer  to  computations  that  can  be  expressed  with  pseudocode  similar  to  that 
of  matrix  multiplication  above. 

4.1.1  Lower  Bound  Argument 

We  first  define  our  model  of  computation  formally,  and  illustrate  it  on  the  case  of  matrix 
multiplication:  C  =  C  +  A-B.  Let  Sa  C  {1,  2, . . . ,  n}  x  {1,  2, . . . ,  n},  corresponding  in  matrix 
multiplication  to  the  subset  of  entries  of  the  indices  of  the  input  matrix  A  that  are  accessed 
by  the  algorithm  ( e.g .,  the  indices  of  the  nonzero  entries  of  a  sparse  matrix).  Let  A4  be 
the  set  of  locations  in  slow/global  memory  (on  a  parallel  machine  A4  refers  to  a  location  in 
some  processor’s  memory;  the  processor  number  is  implicit).  Let  a  :  Sa  K >  A4  be  a  mapping 
from  the  indices  to  memory,  and  similarly  define  Sb,Sc  and  b (• ,  •) ,  c(-,  *) ,  corresponding  to 
the  matrices  B  and  C.  The  value  of  a  memory  location  l  £  A4  is  denoted  by  Mem(l).  We 
assume  that  the  values  are  independent — he.,  determining  any  value  requires  accessing  the 
memory  location. 

Definition  4.1  (3NL  Computation).  A  computation  is  considered  to  be  three-nested-loops 
(3NL)  if  it  includes  computing,  for  all  ( i,j )  £  Sc  with  Sjj  C  {1,  2, . . . ,  n}, 

Mem(c(i,j ))  =  ftj  ^{gijk(Mem(a(i,  k)),  Mem(b(k,  j))}keS.^j 

where 

(a)  mappings  a,  b,  and  c  are  all  one-to-one  into  slow/global  memory,  and 

(b)  functions  ft]  and  g^k  depend  non-trivially  on  their  arguments. 

Further,  define  a  3NL  operation  as  an  evaluation  of  a  gijk  function,  and  let  G  be  the  number 
of  unique  3NL  operations  performed: 


g=  Y.  is«i- 

(ij)eSc 

Note  that  while  each  mapping  a,  b  and  c  must  be  one-to-one,  the  ranges  are  not  required 
to  be  disjoint.  For  example,  if  we  are  computing  the  square  of  a  matrix,  then  A  =  B  and 
a  =  b,  and  the  computation  is  still  3NL. 

By  requiring  that  the  functions  ftJ  and  gijk  depend  “non-trivially”  on  their  arguments, 
we  mean  the  following:  we  need  at  least  one  word  of  space  to  compute  ftj  (which  may  or 
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may  not  be  Mem(c(i,  j)))  to  act  as  “accumulator”  of  the  value  of  f(J ,  and  we  need  the  values 
Mem(a(i,k))  and  Mem(b(k,  j))  to  be  in  fast  or  local  memory  before  evaluating  g^.  Note 
that  fij  and  gt]k  may  depend  on  other  arguments,  but  we  do  not  require  that  the  functions 
depend  non-trivially  on  them. 

Note  also  that  we  may  not  know  until  after  the  computation  what  Sc,  ftj ,  ,  or  gVJk 

were,  since  they  may  be  determined  on  the  fly.  For  example,  in  the  case  of  sparse  matrix 
multiplication,  the  sparsity  pattern  of  the  output  matrix  C  may  not  be  known  at  the  start 
of  the  algorithm.  There  may  even  be  branches  in  the  code  based  on  random  numbers,  or 
in  the  case  of  LU  decomposition,  pivoting  decisions  are  made  through  the  course  of  the 
computation. 

Now  we  illustrate  the  notation  in  Definition  4.1  for  the  case  of  sequential  dense  n  x  n 
matrix  multiplication  C  =  C  +  A  ■  B,  where  A,  B  and  C  are  stored  in  column-major  layout 
in  slow  memory.  We  take  Sc  as  all  pairs  (i,j)  with  1  <  i,j  <  n  with  output  entry  (re¬ 
stored  in  location  c (i,j)  —  C  +  (i  —  1)  +  (j  —  1)  •  n,  where  C  is  some  memory  location.  Input 
matrix  entry  A ^  is  analogously  stored  at  location  a(i,  k )  —  A  +  (i  —  1)  +  (k  1)  •  n  and  Bkj 
is  stored  at  location  b(k,j)  =  B  +  (k  —  1)  +  (j  —  1)  •  n,  where  A  and  B  are  offsets  chosen 
so  that  none  of  the  matrices  overlap.  The  set  Sl}  =  {1,2 ,  for  all  Operation 

gijk  is  scalar  multiplication,  and  fij  computes  the  sum  of  its  n  arguments.  Thus,  G  =  n3. 
In  the  case  of  parallel  matrix  multiplication,  a  single  processor  will  perform  only  a  subset 
of  the  computation.  In  this  case,  for  a  given  processor,  the  sizes  of  the  sets  Sc  and  SV,- 
may  be  smaller  than  n 2  and  n,  respectively,  and  G  will  become  n3 / P  if  the  computation  is 
load-balanced. 

We  now  state  and  prove  the  communication  lower  bound  for  3NL  computation. 
Theorem  4.2.  The  bandwidth  cost  lower  bound  of  a  3NL  computation  (Definition  4-1)  is 

W  >  ~^=  -  M 
8  VM 

where  M  is  the  size  of  the  fast/local  memory. 

Proof.  Following  [95],  we  consider  any  implementation  of  the  computation  as  a  stream  of 
instructions  involving  computations  and  memory  operations:  loads  and  stores  from  and  to 
slow/global  memory.  The  argument  is  as  follows: 

•  Break  the  stream  of  instructions  executed  into  segments,  where  each  segment  contains 
exactly  M  load  and  store  instructions  (he.,  that  cause  communication),  where  M  is 
the  fast  (or  local)  memory  size. 

•  Bound  from  above  the  number  of  3NL  operations  that  can  be  performed  during  any 
given  segment,  calling  this  upper  bound  F. 

•  Bound  from  below  the  number  of  (complete)  segments  by  the  total  number  of  3NL 
operations  divided  by  F,  i.e.,  [G/F\. 
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•  Bound  from  below  the  total  number  of  loads  and  stores,  by  M  (load/stores  per  segment) 
times  the  minimum  number  of  complete  segments,  [G/F J,  so  it  is  at  least  M  ■  [G/F\. 

Because  functions  fij  and  gl3k  depend  non-trivially  on  their  arguments,  an  evaluation  of 
a  gtjk  function  requires  that  the  two  input  operands  must  be  resident  in  fast  memory  and 
the  output  operand  (which  may  be  an  accumulator)  must  either  continue  to  reside  in  fast 
memory  or  be  written  to  slow/global  memory  (it  cannot  be  discarded). 

For  a  given  segment,  we  can  bound  the  number  of  input  and  output  operands  that 
are  available  in  fast/local  memory  in  terms  of  the  memory  size  M.  Consider  the  values 
Mem(c(i,j)):  for  each  (i,  j),  at  least  one  accumulator  must  reside  in  fast  memory  during 
the  segment;  since  there  are  at  most  M  words  in  fast  memory  at  the  end  of  a  segment 
and  at  most  M  store  operations,  there  can  be  no  more  than  2 M  distinct  accumulators.  Now 
consider  the  values  Mem(a(i,  k)):  at  the  start  of  the  segment,  there  can  be  at  most  M  distinct 
operands  resident  in  fast  memory  since  a  is  one-to-one;  during  the  segment,  there  can  be 
at  most  M  additional  operands  read  into  fast  memory  since  a  segment  contains  exactly  M 
memory  operations.  If  the  range  of  a  overlaps  with  the  range  of  c,  then  there  may  be  values 
Mem(a(i,  k))  which  were  computed  as  Mem(c(i,j))  values  during  the  segment.  Since  there 
are  at  most  2 M  such  operands,  the  total  number  of  Mem(a(i,  k))  values  available  during 
a  segment  is  4 M.  The  same  argument  holds  for  Mem(b(/c,j))  independently.  Thus,  the 
number  of  each  type  of  operand  available  during  a  given  segment  is  at  most  4 M.  Note  that 
this  constant  can  be  improved  in  particular  cases,  for  example  when  the  ranges  of  a,  b,  and 
c  do  not  overlap. 

Now  we  compute  the  upper  bound  F  using  the  geometric  result  of  Loomis  and  Whitney 
[107],  a  simpler  form  of  which  is  given  as  Lemma  2.7.  Let  the  set  of  lattice  points  ( i,j,k ) 
represent  each  function  evaluation  ^(Mem(a(j,  k)),  Mem(b(/c,  j))).  For  a  given  segment, 
let  V  be  the  set  of  indices  (■ i,j,k )  of  the  gl}k  operations  performed  during  the  segment,  Vz 
be  the  set  of  indices  (i,j)  of  their  destinations  c Vy  be  the  set  of  indices  (i,  k )  of  their 
arguments  a(i,k),  and  Vx  be  the  set  of  indices  (j,k)  of  their  arguments  b(j,k).  Then  by 
Lemma  2.7, 

\V\  <  y/\Vx\-\Vy\-\Vz\  <  v/(4 Mf  =  F. 

Therefore  the  total  number  of  loads  and  stores  over  all  segments  is  bounded  by 


M 


G 

F 


M 


G 

7PF 


>°u. 

8  \[M 


□ 

Note  that  the  proof  of  Theorem  4.2  applies  to  any  ordering  of  the  gl3k  operations.  In  the 
case  of  matrix  multiplication,  there  are  no  dependencies  between  gtJk  operations,  so  every 
ordering  will  compute  the  correct  answer.  However,  for  most  other  computations,  there 
are  many  dependencies  that  must  be  respected  for  correct  computation.  This  lower  bound 
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"Cface" 


Figure  4.1:  Geometric  model  of  matrix  multiplication. 


argument  thus  applies  not  only  to  correct  algorithms,  but  also  to  incorrect  ones,  as  long  as 
the  computation  satisfies  the  conditions  of  Definition  4.1. 

To  see  the  relationship  of  the  geometric  inequality  given  by  Lemma  2.7  to  3NL  computa¬ 
tions  (Definition  4.1),  see  Figure  4.1,  shown  for  the  special  case  of  3  x  3  matrix  multiplication. 
We  model  the  computation  as  an  3  x  3  x  3  set  of  lattice  points,  drawn  as  a  set  of  n3  1  x  1  x  1 
cubes  for  easier  labeling:  each  lxlxl  cube  represents  the  lattice  point  at  its  bottom 
front  right  corner.  The  cubes  (or  lattice  points)  are  indexed  from  corner  ( i,j,k )  =  (0,0,0) 
to  (n  —  l,n—  1,  77  —  1).  Cube  (■ i,j ,  k)  represents  the  multiplication  A(i,  k )  •  B(k,j)  and  its 
accumulation  into  C(i,j).  The  lxl  squares  on  the  top  face  of  the  cube,  indexed  by  (i,j), 
represent  C(i,j ),  and  the  lxl  squares  on  the  other  two  faces  represent  A{i,  k )  and  B(k,j), 
respectively.  The  set  of  all  multiplications  performed  during  a  segment  are  some  subset  ( V 
in  Lemma  2.7)  of  all  the  cubes.  All  the  C(i,j)  needed  to  store  the  results  are  the  projections 
of  these  cubes  onto  the  “C-face”  of  the  cube  (' Vz  in  Lemma  2.7).  Similarly,  the  A{i,  k )  needed 
as  arguments  are  the  projections  onto  the  “A-face”  (Vy  in  Lemma  2.7),  and  the  B(k,j )  are 
the  projections  onto  the  “B-face”  (14  in  Lemma2.7). 

4.1.2  Applications  of  the  Lower  Bound 

We  now  show  how  Theorem  4.2  applies  to  a  variety  of  classical  computations  for  numerical 
linear  algebra,  by  which  we  mean  algorithms  that  would  cost  0(n3)  arithmetic  operations 
when  applied  to  dense  7i-by-7i  matrices,  as  opposed  to  Strassen-like  algorithms. 
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4.1. 2.1  BLAS 

We  begin  with  matrix  multiplication,  on  which  we  base  Definition  4.1.  The  proof  is  implicit 
in  the  illustration  of  the  definition  with  matrix  multiplication  in  Section  4.1.1. 

Corollary  4.3.  The  bandwidth  cost  lower  bound  for  classical  (dense  or  sparse)  matrix  mul¬ 
tiplication  is  G /  (8 y/M)  —  M ,  where  G  is  the  number  of  scalar  multiplications  performed.  In 
the  special  case  of  multiplying  a  dense  m  x  k  matrix  times  a  dense  k  x  n  matrix,  this  lower 
bound  is  mkn/ —  M. 

This  reproduces  Theorem  3.2  from  [95]  (with  a  different  constant)  for  the  case  of  two 
distinct,  dense  matrices,  though  we  need  no  such  assumptions.  We  note  that  this  result  could 
have  been  stated  for  sparse  A  and  B  in  [88]:  combine  their  Theorem  6.1  (their  f2(|E|)  is  the 
number  of  scalar  multiplications)  with  their  Lemma  6.1  (whose  proof  does  not  require  A  and 
B  to  be  dense).  For  algorithms  attaining  this  bound  in  the  dense  case,  see  Sections  7.1.1, 
8.1.1,  and  8.2.1.  For  further  discussion  of  this  bound  in  the  sparse  case,  see  [16,  Section  3]. 

We  next  extend  Theorem  4.2  beyond  matrix  multiplication.  The  simplest  extension  is  to 
the  so-called  Level-3  BLAS  (Basic  Linear  Algebra  Subroutines  [45]),  which  include  related 
operations  like  multiplication  by  (conjugate)  transposed  matrices,  by  triangular  matrices 
and  by  symmetric  (or  Hermitian)  matrices.  Corollary  4.3  applies  to  these  operations  without 
change  (in  the  case  of  AT  ■  A  we  use  the  fact  that  Theorem  4.2  makes  no  assumptions  about 
the  matrices  being  multiplied  not  overlapping). 

More  interesting  is  the  Level-3  BLAS  operation  for  solving  a  triangular  system  with 
multiple  right  hand  sides  (TRSM),  computing  for  example  C  =  A~1B  where  A  is  triangular. 
The  classical  dense  computation  (when  A  is  upper  triangular)  is  specified  by 

n 

Cij  =  ( Bij  —  Aik  ■  C^/An  (4.1) 

k=i-\- 1 

which  can  be  executed  in  any  order  with  respect  to  j,  but  only  in  decreasing  order  with 
respect  to  i. 

Corollary  4.4.  The  bandwidth  cost  lower  bound  for  classical  (dense  or  sparse)  TRSM  is 
G/(8a/M)  —  M,  where  G  is  the  number  of  scalar  multiplications  performed.  In  the  special 
case  of  solving  a  dense  triangular  n  x  n  system  with  m  right  hand  sides,  this  lower  bound  is 
Ut(mn2  /  \[M) . 

Proof.  We  need  only  verify  that  TRSM  is  a  3NL  computation.  We  let  fa  be  the  function 
defined  in  Equation  (4.1)  (or  similarly  for  lower  triangular  matrices  or  other  variants).  Then 
we  make  the  correspondences  that  Ckj  is  stored  at  location  c (i,j)  =  b(i,j),  Aik  is  stored 
at  location  a(i,k),  and  g%]k  multiplies  Aik  ■  Ckj.  Since  A  is  an  input  stored  in  slow/global 
memory  and  C  is  the  output  of  the  operation  and  must  be  written  to  slow/global  memory, 
the  mappings  a,  b,  and  c  are  all  one-to-one  into  slow/global  memory.  Note  that  c  =  b  does 
not  prevent  the  computation  from  being  3NL.  Further,  functions  ft]  (involving  a  summation 
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°f  Qijk  outputs)  and  gtJk  (scalar  multiplication)  depend  non-trivially  on  their  arguments. 
Thus,  the  computation  is  3NL. 

In  the  case  of  dense  n  x  n  triangular  A  and  dense  n  x  m  B,  the  number  of  scalar 
multiplications  is  G  =  0(mn2).  □ 

See  Sections  7.1.1,  8.1.1,  and  8.2.2  for  discussions  of  algorithms  attaining  this  bound  for 
dense  matrices. 

Given  a  lower  bound  for  TRSM,  we  can  obtain  lower  bounds  for  other  computations  for 
which  TRSM  is  a  subroutine.  For  example,  given  an  mxn  matrix  A  (m  >  n),  the  Cholesky- 
QR  algorithm  consists  of  forming  AT A  and  computing  the  Cholesky  decomposition  of  that 
n  x  n  matrix.  The  R  factor  is  the  upper  triangular  Cholesky  factor  and,  if  desired,  Q  is 
obtained  by  solving  the  equation  Q  =  AR “1  using  TRSM.  Note  that  entries  of  R  and  Q 
are  outputs  of  the  computation,  so  both  are  mapped  into  slow/global  memory.  The  com¬ 
munication  lower  bounds  for  TRSM  thus  apply  to  the  Cholesky-QR  algorithm  (and  reflect 
a  constant  fraction  of  the  total  number  of  multiplications  of  the  overall  dense  algorithm). 

We  note  that  Theorem  4.2  also  applies  to  the  Level  2  BLAS  (like  matrix- vector  multipli¬ 
cation)  and  Level  1  BLAS  (like  dot  products),  though  the  lower  bound  is  not  attainable.  In 
those  cases,  the  number  of  words  required  to  access  each  of  the  input  entries  once  already 
exceeds  the  lower  bound  of  Theorem  4.2. 

4. 1.2. 2  LU  factorization 

Independent  of  sparsity  and  pivot  order,  the  classical  LU  factorization  (with  L  unit  lower 
triangular)  is  specified  by 

Ljj  (A ij  y  (  L}}:  •  U^j )  l U jj  for 
k<j 

Uij  Aij  y  ^  Lj}-:  •  Ukj  for 

k<i 

In  the  sparse  case,  the  equations  may  be  evaluated  for  some  subset  of  indices  (i,j)  and  the 
summations  may  be  over  some  subset  of  the  indices  k.  Equation  (4.2)  also  assumes  pivoting 
has  already  been  incorporated  in  the  interpretation  of  the  indices  i,  j,  and  k.  Note  that 
since  the  set  of  input  and  output  operands  overlap,  there  are  data  dependencies  which  must 
be  respected  for  correct  computation. 

Corollary  4.5.  The  bandwidth  cost  lower  bound  for  classical  (dense  or  sparse)  LU  factor¬ 
ization  is  G/(8y/M)  —  M,  where  G  is  the  number  of  scalar  multiplications  performed.  In  the 
special  case  of  factoring  a  dense  mxn  matrix  with  m  >  n,  this  lower  bound  is  Ul{mn2 /  y/M) . 

Proof.  We  need  only  verify  that  LU  factorization  is  a  3NL  computation.  We  let  fij  be  the 
(piecewise)  function  defined  in  Equation  (4.2).  Then  we  make  the  correspondences  that  L,j 
and  are  stored  at  location  a(i,j)  =  b(i,j)  =  c (i,j).  Note  that  while  a  =  b,  the  sets  Sa 
and  Sb  do  not  overlap,  as  they  access  lower  and  upper  triangles,  respectively,  though  Sc  does 


i  >  j 
i  <  j ■ 


(4.2) 
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overlap  both  Sa  and  ,5V  Since  L  and  U  are  outputs  of  the  operation  and  must  be  written  to 
slow/global  memory,  the  mappings  a,  b,  and  c  are  all  one-to-one  into  slow/global  memory. 
Further,  functions  fj  (involving  a  summation  of  gijk  outputs)  and  gl]k  (scalar  multiplication) 
depend  non-trivially  on  their  arguments.  Thus,  the  computation  is  3NL. 

In  the  case  of  dense  m  x  n  LU  factorization  where  m  >  n,  the  number  of  scalar  multipli¬ 
cations  is  G  =  0(mn2).  □ 

Note  that  Corollary  4.5  reproduces  the  result  from  the  reduction  argument  in  Section 
3.2.1  for  dense  and  square  factorization.  However,  this  corollary  is  a  strict  generalization, 
as  it  also  applies  to  sparse  and  rectangular  factorizations.  For  a  discussion  of  algorithms 
attaining  this  bound  for  dense  matrices,  see  Sections  7.1.4  and  8.1.4. 

Consider  incomplete  LU  (ILU)  factorization  [127],  where  some  entries  of  L  and  U  are 
omitted  in  order  to  speed  up  the  computation.  In  the  case  of  level-based  incomplete  factoriza¬ 
tions  (i.e.,  ILU(p)),  Corollary  4.5  applies  with  G  corresponding  to  the  scalar  multiplications 
performed.  However,  consider  threshold-based  ILU,  which  computes  a  possible  nonzero  entry 
Lij  or  Uij  and  compares  it  to  a  threshold,  storing  it  only  if  it  is  larger  than  the  threshold 
and  discarding  it  otherwise.  Does  Corollary  4.5  apply  to  this  computation? 

Because  a  computed  entry  L,j  may  be  discarded,  the  assumption  that  fij  depends  non- 
trivially  on  its  arguments  is  violated.  However,  if  we  restrict  the  count  of  scalar  multiplica¬ 
tions  to  the  subset  of  Sc  for  which  output  entries  are  not  discarded,  then  all  the  assumptions 
of  3NL  are  met,  and  the  lower  bound  applies  (with  G  computed  based  on  the  subset).  This 
count  may  underestimate  the  computation  by  more  than  a  constant  factor  (if  nearly  all 
computed  values  fall  beneath  the  threshold),  but  the  lower  bound  will  be  valid  nonetheless. 
We  consider  another  technique  to  arrive  at  a  lower  bound  for  threshold-based  incomplete 
factorizations  in  Section  4. 2. 2. 2. 

4. 1.2. 3  Cholesky  Factorization 

Independent  of  sparsity  and  (diagonal)  pivot  order,  the  classical  Cholesky  factorization  is 
specified  by 

La  =(4-Eb),/! 

C  .  ,  (4-3) 

Lij  (Up-  ^  j  Lik  ■  Ljk)/Ljj  for  i  .'>  j. 

k<j 

In  the  sparse  case,  the  equations  may  be  evaluated  for  some  subset  of  indices  (i,j)  and  the 
summations  may  be  over  some  subset  of  the  indices  k.  Equation  (4.3)  also  assumes  pivoting 
has  already  been  incorporated  in  the  interpretation  of  the  indices  i,  j,  and  k.  As  in  the 
case  of  LU  factorization,  there  are  data  dependencies  which  must  be  respected  for  correct 
computation. 
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Corollary  4.6.  The  bandwidth  cost  lower  bound  for  classical  (dense  or  sparse)  Cholesky 
factorization  is  G/{8\fM)  —  M,  where  G  is  the  number  of  scalar  multiplications  performed. 
In  the  special  case  of  factoring  a  dense  n  x  n  matrix,  this  lower  bound  is  Ll(n3  /  VM) . 

Proof.  We  need  only  verify  that  Cholesky  factorization  is  a  3NL  computation.  We  let  be 
the  (piecewise)  function  defined  in  Equation  (4.3).  Then  we  make  the  correspondences  that 
Lij  is  stored  at  location  a(i,j)  =  b (i,j)  =  c(i,j).  Note  that  all  three  sets  Sa,  Sb,  and  Sc 
overlap.  Since  L  is  the  output  of  the  operation  and  must  be  written  to  slow/global  memory, 
the  mappings  a,  b,  and  c  are  all  one-to-one  into  slow/global  memory.  Further,  functions 
(involving  a  summation  of  gijk  outputs)  and  gl3k  (scalar  multiplication)  depend  non-trivially 
on  their  arguments.  Thus,  the  computation  is  3NL. 

In  the  case  of  dense  nxn  Cholesky  factorization,  the  number  of  scalar  multiplications  is 
G  =  0(n3).  □ 

Note  that  Corollary  4.6  reproduces  the  result  from  the  reduction  argument  in  Section 
3.2.2  for  dense  factorization.  However,  this  corollary  is  a  strict  generalization,  as  it  also 
applies  to  sparse  factorizations.  For  algorithms  attaining  this  bound  in  the  dense  case,  see 
Sections  7.1.2  and  8.1.2.  As  in  the  case  of  LU  (Section  4. 1.2. 2),  Corollary  4.6  is  general 
enough  to  accommodate  incomplete  Cholesky  (IC)  factorizations  [127]. 

We  now  consider  Cholesky  factorization  on  a  particular  class  of  sparse  matrices  for  which 
computational  lower  bounds  are  known.  Since  these  computational  bounds  apply  to  G, 
Corollary  4.6  leads  directly  to  a  concrete  communication  lower  bound.  Hoffman,  Martin, 
and  Rose  [87]  and  George  [73]  prove  that  a  lower  bound  on  the  number  of  multiplications 
required  to  compute  the  sparse  Cholesky  factorization  of  an  n2-by-n2  matrix  representing  a 
5-point  stencil  on  a  2D  grid  of  n 2  nodes  is  fl(n3).  This  lower  bound  applies  to  any  matrix 
containing  the  structure  of  the  5-point  stencil.  This  yields: 

Corollary  4.7.  In  the  case  of  the  sparse  Cholesky  factorization  of  a  matrix  which  includes 
the  sparsity  structure  of  the  matrix  representing  a  5-point  stencil  on  a  two-dimensional  grid 
of  n2  nodes,  the  bandwidth  cost  lower  bound  is  Ll(n3 /a/M). 

George  [73]  shows  that  this  arithmetic  lower  bound  is  attainable  with  a  nested  dissection 
algorithm  in  the  case  of  the  5-point  stencil.  Gilbert  and  Tarjan  [74]  show  that  the  upper 
bound  also  applies  to  a  larger  class  of  structured  matrices,  including  matrices  associated 
with  planar  graphs.  Recently,  David,  Demmel,  Grigori,  and  Peyronnet  [79]  obtained  new 
algorithms  for  sparse  cases  of  Cholesky  decomposition  that  are  proven  to  be  communication 
optimal  using  this  lower  bound. 

4. 1.2. 4  Computing  Eigenvectors  from  Schur  Form 

The  Schur  decomposition  of  a  matrix  A  is  the  decomposition  A  =  QTQT ,  where  Q  is 
unitary  and  T  is  upper  triangular.  Note  that  in  the  real-valued  case,  Q  is  orthogonal  and  T 
is  quasi-triangular.  The  eigenvalues  of  a  triangular  matrix  are  given  by  the  diagonal  entries. 
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Assuming  all  the  eigenvalues  are  distinct,  we  can  solve  the  equation  T X  =  XD  for  the 
upper  triangular  eigenvector  matrix  X,  where  D  is  a  diagonal  matrix  whose  entries  are  the 
diagonal  of  T .  This  implies  that  for  i  <  j, 

Xij  =  yTijXjj  +  TikXkA  j  ( Tjj  —  Ta )  (4-4) 

\  k=i+ 1  / 

where  Xrj  can  be  arbitrarily  chosen  for  each  j.  Note  that  in  the  sparse  case,  the  equations 
may  be  evaluated  for  some  subset  of  indices  (i,j)  and  the  summations  may  be  over  some 
subset  of  the  indices  k.  After  computing  X ,  the  eigenvectors  of  A  are  given  by  QX. 

Corollary  4.8.  The  bandwidth  cost  lower  bound  for  classically  computing  the  eigenvectors 
of  a  (dense  or  sparse)  triangular  matrix  with  distinct  eigenvalues  is  G/(8a/M)  —  M,  where 
G  is  the  number  of  scalar  multiplications  performed.  In  the  special  case  of  a  dense  triangular 
n  x  n  matrix,  this  lower  bound  is  Ll(n3 / y/M) . 

Proof.  We  need  only  verify  that  the  computation  is  3NL.  We  let  fl3  be  the  function  defined 
in  Equation  (4.4).  Then  we  make  the  correspondences  that  Tt]  is  stored  at  location  a(i,j) 
and  X.^  is  stored  at  location  b (i,j)  =  c(i,j).  Since  T  is  the  input  and  must  be  stored  in 
slow/global  memory,  and  X  is  the  output  of  the  operation  and  must  be  written  to  slow/global 
memory,  the  mappings  a,  b,  and  c  are  all  one-to-one  into  slow/global  memory.  Further, 
functions  fl3  (involving  a  summation  of  gljk  outputs)  and  gljk  (scalar  multiplication)  depend 
non-trivially  on  their  arguments.  Thus,  the  computation  is  3NL. 

In  the  case  of  computing  the  eigenvectors  of  a  dense  triangular  nxn  matrix,  the  number 
of  scalar  multiplications  is  G  =  @(n3).  □ 

4. 1.2. 5  Floyd- War  shall  All-Pairs  Shortest  Paths 

Theorem  4.2  applies  to  more  general  computations  than  strictly  linear  algebraic  ones,  where 
gijk  are  scalar  multiplications  and  fij  are  based  on  summations.  We  consider  the  Floyd- 
Warshall  method  [68,  148]  for  computing  the  shortest  paths  between  all  pairs  of  vertices  in 
a  graph.  If  we  define  to  be  the  shortest  distance  between  vertex  i  and  vertex  j  using 
the  first  k  vertices,  then  executing  the  following  computation  for  all  k ,  i ,  and  j  determines 
in  Dlf-  )  the  shortest  path  between  vertex  i  and  vertex  j  using  the  entire  graph: 

Df  =  min  {D?-1 1),  D(t"  +  D g"1’)  .  (4.5) 

Taking  care  to  respect  the  data  dependencies,  the  computation  can  be  done  in  place,  with 
being  the  original  adjacency  graph  and  each  D(kI  overwriting  The  original 

formulation  of  the  algorithm  consists  of  three  nested  loops  with  k  as  the  outermost  loop 
index,  but  there  are  many  other  orderings  which  maintain  correctness. 
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Corollary  4.9.  The  bandwidth  cost  lower  bound  for  computing  all-pairs  shortest  paths  using 
the  Floyd-Warshall  method  is  G/(8a/M)  —  M,  where  G  is  the  number  of  scalar  additions 
performed.  In  the  special  case  of  computing  all-pairs  shortest  paths  on  a  dense  graph  with  n 
vertices,  this  lower  bound  is  fl(n3/\/M). 

Proof.  We  need  only  verify  that  the  Floyd-Warshall  method  is  a  3NL  computation.  We 
make  the  correspondences  that  is  stored  at  location  a(i,j)  =  b(i,j)  =  c Then  we 
let  dijk  be  the  addition  operation  (adding  values  stored  at  a(i,  k)  and  b(k,j))  and  f^  be  the 
minimum  of  outputs  of  g^k  over  all  k.  Note  that  all  three  sets  Sa,  Sb,  and  Sc  overlap.  Since 
D  is  the  input  and  output  of  the  operation  and  must  be  written  to  slow/global  memory,  the 
mappings  a,  b,  and  c  are  all  one-to-one  into  slow/global  memory.  Further,  functions  and 
g^k  depend  non-trivially  on  their  arguments.  Thus,  the  computation  is  3NL. 

In  the  case  of  a  dense  graph  with  n  vertices,  the  number  of  scalar  additions  is  G  = 
0(n3).  □ 

This  result  is  also  claimed  (without  proof)  as  [118,  Lemma  1],  The  authors  also  provide 
a  sequential  algorithm  attaining  the  bound.  For  a  parallel  algorithm  attaining  this  bound, 
see  [136]. 

4.2  Lower  Bounds  for  Three-Nested-Loop 

Computation  with  Temporary  Operands 

4.2.1  Lower  Bound  Argument 

Many  linear  algebraic  computations  are  nearly  3NL  but  fail  to  satisfy  the  assumption  that 
a,  b  and  c  are  one-to-one  mappings  into  slow/global  memory.  In  this  section,  we  show  that, 
under  certain  assumptions,  we  can  still  prove  meaningful  lower  bounds.  We  will  first  consider 
two  examples  to  provide  intuition  for  the  proof  and  then  make  the  argument  rigorous. 

Consider  computing  the  Frobenius  norm  of  a  product  of  matrices:  =  JN -(AS)?-. 

If  we  define  f,j  as  the  square  of  the  dot  product  of  row  i  of  A  and  column  j  of  B ,  the  output  of 
fij  does  not  necessarily  map  to  a  location  in  slow  memory  because  entries  of  the  product  A-B 
are  only  temporary  values,  not  outputs  of  the  computation  (the  norm  is  the  only  output). 
However,  in  order  to  compute  the  norm  correctly,  every  entry  of  A  ■  B  must  be  computed, 
so  the  matrix  might  as  well  be  an  output  of  the  computation  (in  which  case  Theorem  4.2 
would  apply).  In  the  proof  of  Theorem  4.10  below,  we  show  using  a  technique  of  imposing 
writes  that  the  amount  of  communication  required  for  computations  with  temporary  values 
like  these  is  asymptotically  the  same  as  those  forced  to  output  the  temporary  values. 

While  the  example  above  illustrates  temporary  output  operands  of  functions,  a  com¬ 
putation  may  also  have  temporary  input  operands  to  gijk  functions.  For  example,  if  we 
want  to  compute  the  Frobenius  norm  of  a  product  of  matrices  where  the  entries  of  the  input 
matrices  are  given  by  formulas  (e.g.,  Aij  =  i2  +  j),  then  the  computation  may  require  very 
little  communication  since  the  entries  can  be  recomputed  on  the  fly  as  needed.  However,  if 
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we  require  that  each  temporary  input  operand  be  computed  only  once,  and  we  map  each 
operand  to  a  location  in  slow/global  memory,  then  if  the  operand  has  already  been  computed 
and  does  not  reside  in  fast  memory,  it  must  be  read  from  slow/global  memory.  The  assump¬ 
tion  that  each  temporary  operand  be  computed  only  once  seems  overly  restrictive  in  the 
case  of  matrix  entries  given  by  simple  formulas  of  their  indices,  but  for  many  other  common 
computations  (see  Section  4.2.2),  the  temporary  values  are  more  expensive  to  compute  and 
recomputation  on  the  fly  is  more  difficult. 

We  now  make  the  intuition  given  in  the  examples  above  more  rigorous.  First,  we  define  a 
temporary  value  as  any  value  involved  in  a  computation  that  is  not  an  original  input  or  final 
output.  In  particular,  a  temporary  value  needs  not  be  mapped  to  a  location  in  slow  memory. 
Next,  we  distinguish  a  particular  set  of  temporary  values:  we  define  the  temporary  inputs 
to  gtjk  functions  and  temporary  outputs  of  ftJ  functions  as  temporary  operands.  While  there 
may  be  other  temporary  values  involved  in  the  computation  ( e.g .,  outputs  of  gt]k  functions), 
we  do  not  consider  them  temporary  operands.1  A  temporary  input  a(i,k)  may  be  an  input 
to  multiple  g^  functions  ( g^  and  gtj>k  for  j  ^  j'f  but  we  consider  it  a  single  temporary 
operand.  There  may  also  be  multiple  accumulators  for  one  output  of  an  fi3  function,  but  we 
consider  only  the  final  computed  output  as  a  temporary  operand.  In  the  case  of  computing 
the  Frobenius  norm  of  a  product  of  matrices  whose  entries  are  given  by  formulas,  the  number 
of  temporary  operands  is  3n2,  corresponding  to  the  entries  of  the  input  and  output  matrices. 

We  now  state  the  result  more  formally: 

Theorem  4.10.  Suppose  a  computation  is  3NL  except  that  some  of  its  operands  (\.e.,  inputs 
to  gtjk  operations  or  outputs  of  fVj  functions)  are  temporary  and  are  not  necessarily  mapped 
to  slow/global  memory.  Then  if  the  number  of  temporary  operands  is  t,  and  if  each  (input 
or  output)  temporary  operand  is  computed  exactly  once,  then  the  bandwidth  cost  lower  bound 
is  given  by 

W  >  -M-t 
8  y/M 

where  M  is  the  size  of  the  fast/local  memory. 

Proof.  Let  C  be  such  a  computation,  and  let  C  be  the  same  computation  with  the  excep¬ 
tion  that  for  each  of  the  three  types  of  operands  (defined  by  mappings  a,  b,  and  c),  every 
temporary  operand  is  mapped  to  a  distinct  location  in  slow/global  memory  and  must  be 
written  to  that  location  by  the  end  of  the  computation.  This  enforces  that  the  mappings 
a,  b,  and  c  are  all  one-to-one  into  slow/global  memory,  and  so  C  is  a  3NL  computation 
and  Theorem  4.2  applies.  Consider  an  algorithm  that  correctly  performs  computation  C. 
Modify  the  algorithm  by  imposing  writes:  every  time  a  temporary  operand  is  computed,  we 
impose  a  write  to  the  corresponding  location  in  slow/global  memory  (a  copy  of  the  value 
may  remain  in  fast  memory).  After  this  modification,  the  algorithm  will  correctly  perform 
the  computation  C .  Thus,  since  every  temporary  operand  is  computed  exactly  once,  the 

1We  ignore  these  other  temporary  values  because,  as  in  the  case  of  true  3NL  computations,  they  typically 
do  not  require  any  memory  traffic. 
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bandwidth  cost  of  the  algorithm  differs  by  at  most  t  words  from  an  algorithm  to  which  the 
lower  bound  W  >  G/(8\/M)  —  M  applies,  and  the  result  follows.  □ 

4.2.2  Applications  of  the  Lower  Bound 

4. 2. 2.1  Solving  the  Normal  Equations 

In  Section  4. 1.2.1,  we  prove  a  communication  lower  bound  for  the  Cholesky-QR  computa¬ 
tion  by  applying  Theorem  4.2  to  the  TRSM  required  to  compute  the  orthogonal  matrix  Q. 
In  some  cases,  such  as  solving  the  normal  equations  AT Ax  =  ATb  (by  forming  AT A  and 
performing  a  Cholesky  decomposition),  the  Cholesky-QR  computation  does  not  form  Q  ex¬ 
plicitly.  Here,  we  apply  Theorem  4.10  to  the  computation  AT A,  the  output  of  which  is  a 
temporary  matrix. 

Corollary  4.11.  The  bandwidth  cost  lower  bound  for  solving  the  7iormal  equations  to  find 
the  solution  to  a  least  squares  problem  with  matrix  A  is  G/{8\^M)  —  M  —  t,  where  G  is 
the  number  of  scalar  multiplications  involved  in  the  computation  of  ATA,  t  is  the  number  of 
nonzeros  in  AT A,  and  we  assume  the  entries  of  ATA  are  computed  only  once.  In  the  special 
case  of  a  dense  m  x  n  matrix  A  with  m  >  n,  this  lower  bound  is  Lt(mn2 / VM) . 

Proof.  As  argued  in  Section  4. 1.2.1,  computing  AT A  is  a  3NL  computation  (we  ignore  the 
Cholesky  factorization  and  triangular  solves  in  this  proof).  That  is,  a(i,  j)  =  b(j,  z)  and  fl3  is 
the  summation  function  defined  for  either  the  lower  ( i  >  j)  or  upper  ( i  <  j)  triangle  because 
the  output  is  symmetric.  Since  A  is  an  input  to  the  normal  equations,  it  must  be  stored  in 
slow  memory.  However,  the  output  of  AT A  need  not  be  stored  in  slow  memory  (its  Cholesky 
factor  will  be  used  to  solve  for  the  final  output  of  the  computation).  Thus,  the  number  of 
temporary  operands  is  the  number  of  nonzeros  in  the  output  of  ATA,  which  are  all  outputs 
of  fij  functions.  In  the  case  of  a  dense  m  x  n  matrix  A  with  m  >  n,  the  output  AT A  is 
n  x  n.  When  m,n  >  \/M,  the  mn2  j\fM  term  asymptotically  dominates  the  (negative)  M 
and  n2 / 2  terms.  □ 

4. 2. 2. 2  Incomplete  Factorizations 

In  Section  4. 1.2.2,  we  consider  lower  bounds  for  threshold-based  incomplete  LU  factoriza¬ 
tions.  Because  Theorem  4.2  requires  that  the  f^  functions  depend  non-trivially  on  their 
arguments,  we  must  ignore  all  scalar  multiplications  that  lead  to  discarded  outputs  (due  to 
their  values  falling  below  the  threshold).  However,  because  the  output  values  must  be  fully 
computed  before  comparing  them  to  the  threshold  value,  we  may  be  ignoring  a  significant 
amount  of  the  computation.  Using  Theorem  4.10,  we  can  state  another  (possibly  tighter) 
lower  bound  which  counts  all  the  scalar  multiplications  performed  by  imposing  reads  and 
writes  on  the  discarded  values. 

Corollary  4.12.  Consider  a  threshold-based  incomplete  LU  or  Cholesky  factorization,  and 
let  t  be  the  number  of  output  values  discarded  due  to  thresholding.  Assuming  each  dis- 
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carded  value  is  computed  exactly  once,  the  bandwidth  cost  lower  bound  for  the  computation 
is  G /  (8 \[M)  —  M  —  t  where  G  is  the  number  of  scalar  multiplications. 

Proof.  As  argued  in  Section  4. 1.2. 2,  sparse  LU  and  Cholesky  factorizations  are  3NL  com¬ 
putations.  In  a  threshold-based  incomplete  factorizations,  some  output  values  are  discarded 
if  they  are  smaller  than  a  given  threshold.  These  discarded  values  are  the  only  temporary 
operands  in  the  computation,  so  the  result  follows  from  Theorem  4.10.  □ 

4. 2. 2. 3  LDLT  Factorization 

Independent  of  sparsity  and  pivot  order,  the  classical  Bnnch-Kanfman  LDLT  factorization 
[47],  where  L  is  unit  lower  triangular  and  D  is  block  diagonal  with  lxl  and  2x2  blocks, 
is  specified  in  the  case  of  positive  or  negative  definite  matrices  (ie.,  all  diagonal  blocks  of  D 
are  1  x  1)  by 


Djj  Ajj  LjkDkk 

k<j 

Ljj  ( Ajj  ''y  ]  Llk  •  (DkkLjk))/Djj  for  i  >  j. 

k<j 


(4.6) 


In  the  sparse  case,  the  equations  may  be  evaluated  for  some  subset  of  indices  ( i,j )  and  the 
summations  may  be  over  some  subset  of  the  indices  k.  Equation  (4.6)  also  assumes  pivoting 
has  already  been  incorporated  in  the  interpretation  of  the  indices  i,  j  and  k. 

Note  that  in  this  case,  the  operand  ( DkkLjk )  is  a  temporary  operand.  This  complication 
is  overlooked  in  [28],  where  it  is  claimed  that  the  argument  for  Cholesky  also  applies  to 
LDL 1  factorization.  In  the  terminology  of  this  chapter,  it  is  assumed  that  LDLT  is  3NL. 
Here,  we  obtain  the  lower  bound  as  a  corollary  of  Theorem  4.10. 

To  specify  the  more  general  computation  where  D  includes  both  lxl  and  2x2  blocks, 
we  define  the  matrix  W  =  DLT  and  let  S  be  the  set  of  rows/columns  corresponding  to  a 
lxl  block  of  D.  Then  for  j  G  S,  the  computation  of  column  j  of  L  can  be  written  similarly 
to  Equation  (4.6): 

Lij  =  ( Aij  —  y  ~]  Lik  ■  Wkj) / Djj,  (4-7) 

k<j 


for  i  >  j  £  S.  For  pairs  of  columns  j,  j  +  1  ^  S  corresponding  to  2  x  2  blocks  of  D,  we  will 
use  colon  notation  to  describe  the  computation  for  pairs  of  elements: 


Li 


1 


—  (Aij: 


ij+ 1 


^  Ll-k  •  Wkj:j  +  l)Dj:j+1j.j+1, 

k<j 


(4.8) 


for  i  >  j  +  1. 

Corollary  4.13.  The  bandwidth  cost  lower  bound  for  classical  (dense  or  sparse )  LDL1 
factorization  is  G/ (\/8M)  —  M  —  t ,  where  G  is  the  number  of  scalar  multiplications  performed 
in  computing  L,  t  is  the  number  of  nonzeros  in  the  matrix  DLT,  and  we  assume  the  entries 
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of  DLt  are  computed  only  once.  In  the  special  case  of  factoring  a  dense  n  x  n  matrix,  this 
lower  bound  is  D(n3 / y/M) . 

Proof.  We  first  verify  that  LDLT  is  a  3NL  computation,  though  with  temporary  operands. 
For  simplicity,  we  consider  the  case  of  all  lxl  blocks,  but  the  analysis  follows  for  the  case  of 
2x2  blocks  using  Equations  (4.7)  and  (4.8).  We  let  fVJ  be  the  function  defined  in  Equation 
(4.6)  and  gijk  be  the  scalar  multiplication  of  Llk  and  Wkj .  Then  we  make  the  correspondences 
that  Lij  is  stored  at  location  c (i,j)  =  &(i,j)  and  WVJ  is  stored  at  location  b(i,j).  Since  L  is  an 
input  stored  in  slow/global  memory,  the  mappings  a  and  c  are  one-to-one  into  slow/global 
memory.  Further,  functions  ftj  (involving  a  summation  of  g^k  outputs)  and  gtJk  (scalar 
multiplication)  depend  non-trivially  on  their  arguments.  Thus,  the  computation  is  3NL, 
with  the  exception  that  the  mapping  b  is  not  necessarily  into  slow/global  memory,  and  the 
number  of  temporary  operands  is  the  number  of  nonzeros  in  the  matrix  W  =  DLT . 

In  the  case  of  dense  n  x  n  LDLT  factorization,  the  number  of  scalar  multiplications 
required  to  compute  L  is  G  =  @(n3).  □ 

See  Sections  7.1.3  and  8.1.3  for  a  discussion  of  algorithms  for  this  computation.  No  known 
sequential  algorithm  attains  this  bound  for  all  matrix  dimensions. 

4. 2. 2. 4  LTLT  Factorization 

Another  symmetric  indefinite  factorization  computes  a  lower  triangular  matrix  L  and  a 
symmetric  tridiagonal  matrix  T  such  that  A  =  LTLT .  Symmetric  pivoting  is  required 
for  numerical  stability.  Parlett  and  Reid  [119]  developed  an  algorithm  for  computing  this 
factorization  requiring  approximately  (2/3 )n3  flops,  the  same  cost  as  LU  factorization  and 
twice  the  computational  cost  of  Cholesky.  Aasen  [1]  improved  the  algorithm  and  reduced  the 
computational  cost  to  (l/3)n3,  making  use  of  a  temporary  upper  Hessenberg  matrix  H  = 
TLT .  Aasen’s  algorithm  works  by  alternately  solving  for  unknown  values  in  the  equations 
A  =  LH  and  H  =  TL1 .  Because  the  matrix  H  is  integral  to  the  computation  but  is  a 
temporary  matrix,  we  use  Theorem  4.10  to  obtain  a  communication  lower  bound. 

In  fact,  the  computation  can  be  generalized  to  compute  a  symmetric  band  matrix  T  with 
bandwidth  b  (ie.,  b  is  the  number  of  nonzero  diagonals  both  below  and  above  the  main 
diagonal  of  T),  in  which  case  the  matrix  H  has  b  nonzero  subdiagonals.  For  uniqueness, 
the  L  matrix  is  set  to  have  unit  diagonal  and  the  first  b  columns  of  L  are  set  to  the  first  b 
columns  of  the  identity  matrix.  Because  there  are  multiple  ways  to  compute  T  and  H,  we 
specify  a  classical  LTLt  computation  in  terms  of  the  lower  triangular  matrix  L: 

j-b 

Lij  (Aij—i,  ^  ]  h  1  c ll k ./  h ) 1 1 Iji .j — i)  for  b  <  j  i  Sh  Tb-  (4.9) 

k=b- (-1 

In  the  sparse  case,  the  equations  may  be  evaluated  for  some  subset  of  indices  (i,j)  and 
the  summations  may  be  over  some  subset  of  the  indices  k.  Equation  (4.9)  also  assumes 
pivoting  has  already  been  incorporated  in  the  interpretation  of  the  indices  i,  j,  and  k. 
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Corollary  4.14.  The  bandwidth  cost  lower  bound  for  classical  (dense  or  sparse)  LTLT 
factorization  is  G/ (8 \/M)  —  M  —  t,  where  G  is  the  number  of  scalar  multiplications  performed 
in  computing  L,  t  is  the  number  of  nonzeros  in  the  matrix  TLt ,  and  we  assume  the  entries 
ofTLT  are  computed  only  once.  In  the  special  case  of  factoring  a  dense  n  x  n  matrix  (with 
T  having  bandwidth  b  <C  n),  this  lower  bound  is  0(n3 / y/M) . 

Proof.  We  first  verify  that  LTLT  is  a  3NL  computation,  though  with  temporary  operands. 
We  let  fij  be  the  function  defined  in  Equation  (4.9)  and  gljk  be  the  scalar  multiplication  of  Llk 
and  Hkj-b-  Then  we  make  the  correspondences  that  Lt]  is  stored  at  location  c(i,j)  =  u(i,j) 
and  is  stored  at  location  Since  L  is  an  input  stored  in  slow/global  memory,  the 

mappings  a  and  c  are  one-to-one  into  slow/global  memory.  Further,  functions  fl3  (involving 
a  summation  of  gljk  outputs)  and  gtjk  (scalar  multiplication)  depend  non-trivially  on  their 
arguments.  Thus,  the  computation  is  3NL,  with  the  exception  that  the  mapping  b  is  not 
necessarily  into  slow/global  memory,  and  the  number  of  temporary  operands  is  the  number 
of  nonzeros  in  the  matrix  H  =  TL7 . 

In  the  case  of  a  dense  n  x  n  matrix,  the  number  of  scalar  multiplications  is  G  =  n3  / 6  + 
0(ri2b)  and  the  number  of  nonzeros  in  H  is  n2 /2  +  0(nb).  When  n  >  \[M,  the  n3 /\[M  term 
asymptotically  dominates  the  (negative)  M  and  0(n 2)  terms.  □ 

This  bound  is  attainable  in  the  sequential  case  by  the  algorithm  presented  in  Chapter  9 
(see  Section  9.3.2  for  details  of  the  communication  costs).  See  Sections  7.1.3  and  8.1.3  for 
more  discussions  of  symmetric-indefinite  algorithms. 

4. 2. 2. 5  Gram-Schmidt  Orthogonalization 

Here  we  consider  the  Gram-Schmidt  process  (both  classical  and  modified  versions)  for  or¬ 
thogonalization  of  a  set  of  vectors  (see  [57,  Algorithm  3.1],  for  example).  Given  a  set  of 
vectors  stored  as  columns  of  an  m  x  n  matrix  A,  the  Gram-Schmidt  process  computes  a 
QR  decomposition,  though  the  R  matrix  is  sometimes  not  considered  part  of  the  output. 
For  generality,  we  assume  the  R  matrix  is  not  a  final  output  and  use  Theorem  4.10.  Let¬ 
ting  the  columns  of  Q  be  the  computed  orthonormal  basis,  we  specify  the  Gram-Schmidt 
computation  in  terms  of  the  equation  for  computing  entries  of  R.  In  the  case  of  Classical 
Gram-Schmidt,  we  have 

m 

Rjj  —  y;  QkjAkfy  (4-io) 

k= i 

and  in  the  case  of  Modified  Gram-Schmidt,  we  have 

m 

Rjj  —  ^  QkiQkj j  (4-ll) 

k= i 

where  Qkj  is  the  partially  computed  value  of  the  jth  orthonormal  vector.  In  the  sparse  case, 
the  equations  may  be  evaluated  for  some  subset  of  indices  (i,j)  and  the  summations  may  be 
over  some  subset  of  the  indices  k. 
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Corollary  4.15.  The  bandwidth  cost  lower  bound  for  (dense  or  sparse)  QR  factorization 
using  classical  or  modified  Gram-Schmidt  orthogonalization  is  G/(8y/M)  —  M  —  t,  where 
G  is  the  number  of  scalar  multiplications  performed  in  computing  R,  t  is  the  number  of 
nonzeros  in  R,  and  we  assume  the  entries  of  R  are  computed  only  once.  In  the  special  case 
of  orthogorializing  a  dense  m  x  n  matrix  with  m  >  n,  this  lower  bound  is  0(mn2 / y/M) . 

Proof.  We  first  verify  that  Gram-Schmidt  orthogonalization  is  a  3NL  computation,  possibly 
with  temporary  operands.  For  the  classical  version,  we  let  ftj  be  the  function  defined  in 
Equation  (4.10)  and  be  the  scalar  multiplication  of  Qki  and  A\~j.  For  the  modified  version, 
we  let  fij  be  the  function  defined  in  Equation  (4.11)  and  gijk  be  the  scalar  multiplication 
of  Qki  and  Qkj-  Then  we  make  the  correspondences  that  Qij  is  stored  at  location  a(i,j), 
and  either  Al}  is  stored  at  location  b(i,j)  (in  the  classical  version)  or  a  =  b  (in  the  modified 
version).  Further,  we  let  Rij  be  stored  at  location  c (i,j).  Since  A  is  an  input  and  Q 
is  an  output,  the  mappings  a  and  b  are  one-to-one  into  slow/global  memory.  If  R  is  an 
output  of  the  computation,  then  c  is  also  one-to-one  into  slow/global  memory;  otherwise, 
we  impose  reads  and  writes  on  the  temporary  entries  of  R.  Further,  functions  fl3  (involving 
a  summation  of  gtjk  outputs)  and  gtJk  (scalar  multiplication)  depend  non-trivially  on  their 
arguments.  Thus,  the  computation  is  3NL,  with  the  exception  that  the  mapping  c  is  not 
necessarily  into  slow/global  memory,  and  the  number  of  temporary  operands  is  the  number 
of  nonzeros  in  R. 

In  the  case  of  a  dense  m  x  n  matrix,  the  number  of  scalar  multiplications  is  G  =  0(mn2) 
and  the  number  of  nonzeros  in  R  is  about  n2 / 2.  When  m,n  >  \fM,  the  mn2/y/M  term 
asymptotically  dominates  the  (negative)  M  and  n 2  terms.  □ 

4.3  Applying  Orthogonal  Transformations 

The  most  stable  and  highest  performing  algorithms  for  QR  decomposition  are  based  on  ap¬ 
plying  orthogonal  transformations.  Two-sided  orthogonal  transformations  are  used  in  the 
reduction  step  of  the  most  commonly  used  approaches  for  solving  eigenvalue  and  singular 
value  problems:  transforming  a  matrix  to  Hessenberg  form  for  the  nonsymmetric  eigenprob- 
lem,  tridiagonal  form  for  the  symmetric  eigenproblem,  and  bidiagonal  from  for  the  SVD.  In 
this  section,  we  make  two  lower  bound  arguments  in  Sections  4.3.1  and  4.3.2,  Erst  in  the 
context  of  one-sided  orthogonal  transformations  (as  in  QR  decomposition),  and  then  in  Sec¬ 
tion  4.3.3  we  describe  how  to  generalize  the  arguments  for  two-sided  transformations.  Each 
of  the  lower  bound  arguments  requires  certain  assumptions  on  the  algorithms.  We  discuss 
in  Section  4.3.4  to  which  algorithms  each  of  the  lower  bounds  apply. 

The  case  of  applying  orthogonal  transformations  is  more  subtle  to  analyze  for  several  rea¬ 
sons:  (1)  there  is  more  than  one  way  to  represent  the  orthogonal  factor  (e.g.,  Householder 
reflections  and  Givens  rotations),  (2)  the  standard  ways  to  reorganize  or  “block”  transforma¬ 
tions  to  reduce  communication  involve  using  the  distributive  law,  not  just  summing  terms 
in  a  different  order  [42,  122,  131],  and  (3)  there  may  be  many  temporary  operands  that  are 
not  mapped  to  slow/global  memory. 
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To  be  concrete,  we  consider  Householder  transformations,  in  which  an  elementary  real 
orthogonal  matrix  Qi  is  represented  as  Qi  =  /  —  TiU^uf ,  where  u \  is  a  column  vector  called 
a  Householder  vector  and  T\  =  2 / ||i*i || | -  When  applied  from  the  left,  a  single  Householder 
reflection  Q\  is  chosen  so  that  multiplying  Q\  ■  A  annihilates  selected  rows  in  a  particular 
column  of  A,  and  modifies  one  other  row  in  the  same  column  (accumulating  the  weight  of 
the  annihilated  entries).  We  consider  the  Householder  vector  u i  itself  to  be  the  output  of 
the  computation,  rather  than  the  explicit  Q\  matrix.  Note  that  the  Householder  vector  is 
nonzero  only  in  the  rows  corresponding  to  annihilated  entries  and  the  accumulator  entry. 
We  furthermore  model  the  standard  way  of  blocking  Householder  vectors,  writing 

Qr  ■  ■  Qi  =  I  ~  U{T(Uj , 

where  U(  =  [ui,u2,  ■  ■  ■  ,uc)  is  n-by-I  and  Tg  is  I- by-I.  We  specify  the  application  (from 
the  left)  of  blocked  Householder  transformations  to  a  matrix  A  by  inserting  parentheses  as 
follows: 

(I -Ue-  Tt  ■  Uj)  ■  A  =  A  -  Ut  ■  (Tt  -Uj  ■  A)  =  A- Ue-  Ze, 

defining  Zg  —  Tg  ■  Uj  ■  A.  We  also  overwrite  A  with  the  output:  A  :  =  A  —  Ug  ■  Zg. 

The  application  of  one  blocked  transformation  is  a  matrix  multiplication  (which  is  a 
3NL  computation  though  with  some  temporary  operands),  but  in  order  to  show  an  entire 
computation  (like  QR  decomposition)  is  3NL,  we  need  a  global  indexing  scheme  to  define 
the  and  gl]k  functions  and  a,  b,  and  c  mappings.  To  that  end,  we  let  k  be  the  index  of  the 
Householder  vector,  so  that  uk  is  the  kth  Householder  vector  of  the  entire  computation,  and 
we  let  U  =  [ui, ,  Uh ],  where  h  is  the  total  number  of  Householder  vectors.  We  thus  specify 
the  application  of  orthogonal  transformations  (from  the  left)  to  a  matrix  A  as  follows: 

h 

Aij  =  Aij  —  ^  UikZkj,  (4-12) 

k= 1 

where  zk  (the  kill  row  of  Z)  is  a  temporary  quantity  computed  from  A,  uk,  and  possibly 
other  columns  of  U,  depending  on  how  Householder  vectors  are  blocked.  If  A  is  rn  x  n,  then 
U  is  rn  x  h  and  Z  is  h  x  n.  Note  that  in  the  case  of  QR  decomposition,  it  may  be  that  /i>n 
(if  one  Householder  vector  is  used  to  annihilate  each  entry  below  the  diagonal,  for  example). 
The  equations  may  be  evaluated  for  some  subset  of  indices  (i,j)  and  the  summations  are 
over  some  subset  of  the  indices  k  (even  in  the  dense  case).  Equation  (4.12)  also  assumes 
pivoting  has  already  been  incorporated  in  the  interpretation  of  the  indices  i  and  j . 

4.3.1  First  Lower  Bound  Argument:  Applying  Theorem  4.10 

Given  the  specification  of  applying  orthogonal  updates  as  a  3NL  computation  (with  tempo¬ 
rary  operands),  we  can  prove  the  following  corollary  of  Theorem  4.10. 

Corollary  4.16.  The  bandwidth  cost  lower  bound  for  applying  orthogonal  updates  as  specified 
in  Equation  (4.12)  is  —  M  —  t,  where  G  is  the  number  of  scalar  multiplications 
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involved  in  the  computation  of  U Z  and  t  is  the  number  of  nonzeros  in  Z .  In  the  special 
case  of  computing  a  QR  decomposition  of  an  m  x  n  matrix  using  only  a  constant  number  of 
Householder  vectors  per  column,  this  lower  bound  is  Ll(mn2 / y/M) . 

Proof.  We  first  verify  that  applying  orthogonal  transformations  is  a  3NL  computation, 
though  with  temporary  operands.  We  let  fl}  be  the  function  defined  in  Equation  (4.12) 
and  gijk  be  the  scalar  multiplication  of  Uik  and  Zk].  Then  we  make  the  correspondences 
that  Uij  is  stored  at  location  a(i,j)  and  AtJ  is  stored  at  location  c Since  A  is  an  input 
and  U  is  an  output,  the  mappings  a  and  c  are  one-to-one  into  slow/global  memory.  If  we 
let  Zij  be  stored  at  location  then  b  may  not  map  into  slow/global  memory.  Further, 

functions  fij  (involving  a  summation  of  gtjk  outputs)  and  gtjk  (scalar  multiplication)  depend 
non-trivially  on  their  arguments.  Thus,  the  computation  is  3NL,  with  the  exception  that 
the  mapping  b  is  not  necessarily  into  slow/global  memory,  and  the  number  of  temporary 
operands  is  the  number  of  nonzeros  in  Z. 

In  the  case  of  a  dense  mxn  matrix,  the  number  of  scalar  multiplications  is  G  =  0(mn2). 
If  only  a  constant  number  of  Householder  vectors  are  used  per  column,  then  h  =  O(n)  and 
the  number  of  nonzeros  in  Z  is  0(mh )  =  0(mn).  When  m,n  y/M,  the  mn2/y[M  term 
asymptotically  dominates  the  (negative)  M  and  mn  terms.  □ 

Note  that  in  the  sparse  case  where  only  one  Householder  vector  per  column  is  used, 
a  separate  argument  can  bound  the  number  of  nonzeros  in  Z  in  terms  of  the  number  of 
nonzeros  in  the  input  A  and  output  U  (see  [28,  Lemma  4.1]). 

Many  (but  not  all)  algorithms  for  QR  decomposition  use  only  one  Householder  trans¬ 
formation  per  column.  See  Sections  7.1.5  and  8.1.5  for  a  discussion  of  such  algorithms. 
However,  for  algorithms  that  use  many  Householder  transformations  per  column,  the  bound 
in  Corollary  4.16  can  degenerate  to  zero  because  the  number  of  temporary  operands  can  be 
larger  than  the  bandwidth  cost  guarantees  of  the  3NL  argument. 

4.3.2  Second  Lower  Bound  Argument:  Bounding  Z  Values 

In  order  to  devise  a  lower  bound  for  algorithms  that  create  many  Z  values,  we  reconsider  the 
segment-based  argument  of  Theorem  4.2.  The  assumption  we  are  breaking  in  that  argument 
is  that  one  of  the  operands  to  each  gl]k  function  (the  Zkj  entry)  is  a  temporary  value.  In  the 
argument  of  Section  4.3.1,  we  ignore  how  Z  is  computed  and  simply  count  the  total  number 
of  nonzeros  of  Z .  However,  Z  is  defined  in  terms  of  U  and  A  (and  T,  which  itself  depends 
only  on  U),  and  so  entries  of  Z  must  be  computed  from  entries  of  U  and  A.  We  use  this 
property  to  bound  not  the  overall  number  of  Z  entries,  but  the  number  of  Z  entries  that 
are  available  in  any  given  segment  of  the  computation.  Because  a  Z  entry  is  computed  from 
entries  of  U  and  A  and  updates  an  entry  of  A,  in  order  for  a  Z  entry  to  be  computed,  used 
as  many  times  as  necessary,  and  then  discarded  (without  generating  any  memory  traffic),  all 
the  relevant  entries  of  U  and  A  must  be  resident  in  fast  memory.  We  use  this  observation 
to  bound  the  number  of  Z  entries  available  in  any  given  segment,  but  this  argument  will 
require  two  new  assumptions  and  some  extra  notation. 
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As  a  short-hand,  we  will  sometimes  refer  to  a  matrix  entry  as  being  treated  as  nonzero 
(TAN)  if  the  algorithm  assumes  that  its  value  could  be  nonzero  in  deciding  whether  to  bother 
performing  gijk.  Thus  an  algorithm  for  dense  matrices  treats  all  entries  as  nonzero,  even  if 
the  input  matrix  is  sparse,  whereas  a  sparse  factorization  algorithm  would  not.  We  also 
introduce  some  notation: 

•  Let  U(k)  be  the  kth  column  of  U  (which  is  the  kth  Householder  vector).  We  will  use 
U(k)  and  U(:,k)  interchangeably  when  the  context  is  clear. 

•  Let  coLsrc  J7 (k)  be  the  index  of  the  column  in  which  U (k)  introduces  zeros. 

•  Let  rows -U(k)  be  the  set  of  indices  of  rows  TAN  in  U(k).  Let  row.dest  J7(/c)  be 
the  index  of  the  row  in  column  col_src -U(k)  in  which  nonzero  values  in  that  column 
are  accumulated  by  U(k)  1  and  let  zero_rows  J7(&)  be  rows -U(k)  with  row _dest _£/(£;) 
omitted. 

4.3. 2.1  Assumptions 

We  will  make  two  central  assumptions  in  this  case.  First,  we  assume  that  the  algorithm 
does  not  block  Householder  updates  (he.,  all  T  matrices  are  1  x  1).  Second,  we  assume 
the  algorithm  makes  “forward  progress”  which  we  define  below.  As  explained  later,  forward 
progress  is  a  natural  property  of  most  efficient  implementations,  precluding  certain  kinds 
of  redundant  work.  See  [27,  Appendix  A]  for  a  motivating  counterexample  that  breaks  the 
assumption  or  Section  10.1.4  for  a  discussion  of  a  useful  algorithm  that  breaks  the  assumption 
and  beats  the  lower  bound. 

The  first  assumption  means  that  we  are  computing  —  rk  -U(:7k)  •([/(:,  k))T)- 

A,  where  rk  is  scalar.  This  seems  like  a  significant  restriction,  since  blocked  Householder 
transformations  are  widely  used  in  practice.  We  do  not  believe  this  assumption  is  necessary 
for  the  communication  lower  bound  to  be  valid,  but  it  is  necessary  for  our  proof  technique. 
This  assumption  yields  a  partial  order  ( PO )  in  which  the  Householder  updates  must  be 
applied  to  get  the  right  answer.  It  is  only  a  partial  order  because  if,  say,  U(:,k )  and  U (:,  k  + 1) 
do  not  “overlap” ,  he.,  have  no  common  rows  that  are  TAN,  then  (/  -rk-U(:,k)-  ( U (:,  k))T ) 
and  (/  —  Tk+i  •[/(:,  k  +  1)  •  ( U(:,k  +  1))T)  commute,  and  either  one  may  be  applied  first 
(indeed,  they  may  be  applied  independently  in  parallel). 

Definition  4.17  (Partial  Order  on  Householder  vectors  (PO)).  Suppose  k\  <  k2  and 
rowsJJ(ki)  D  rowsJJ(k2)  ^  {0},  then  U(k\)  <  U(k2)  in  the  partial  order. 

We  note  that  this  relation  is  transitive.  That  is,  two  Householder  vectors  U(k\)  and 
U(k2)  are  partially  ordered  if  there  exists  U(k*)  such  that  U(k\)  <  U(k*)  <  U(k2),  even  if 
rowsJ7 (k\)  fl  rows JJ(k2)  =  {0}. 

Our  second  assumption  is  that  the  algorithm  makes  forward  progress: 


CHAPTER  4.  LOWER  BOUNDS  FOR  CLASSICAL  LINEAR  ALGEBRA 


51 


Definition  4.18  (Forward  Progress  ( FP )).  We  say  an  algorithm  which  applies  orthogonal 
transformations  to  zero  out  entries  makes  forward  progress  if  the  following  two  conditions 
hold: 


1.  an  element  that  was  deliberately  zeroed  out  by  one  transformation  is  never  again  zeroed 
out  or  filled  in  by  another  transformation, 

2.  if 

a)  U(k i), . . . ,  U(kb)  <  U(k)  m  PO, 

b)  coLsrcJJ  {ki)  =  ■  •  •  =  coLsrcJJ  (kf)  —  c  ^  c  =  coLsrcJJ  (k), 

c)  and  no  other  U{kf)  satisfies  U{kf)  <  U(k )  and  coLsrcJJ {ki )  =  c, 

then 

b 

rowsJJ(k )  C  zero.rows JU {kf)  U  {rows  of  column  c  that  are  TAZ}  .  (4.13) 

i— 1 


The  first  condition  holds  for  most  efficient  algorithms  applying  orthogonal  transforma¬ 
tions  (though  not  all-see  Chapter  10).  By  “deliberately,”  we  mean  the  algorithm  converted 
a  TAN  entry  into  a  TAZ  entry  with  an  orthogonal  transformation.  The  introduction  of  a 
zero  due  to  accidental  cancellation  (such  zero  entries  are  still  TAN)  is  not  deliberate.  We 
note  also  that  FP  is  not  violated  if  an  original  TAZ  entry  of  the  matrix  is  filled  in  (so  that 
it  is  no  longer  TAZ);  this  is  a  common  situation  when  doing  sparse  QR.  It  is  easy  to  see 
that  it  is  necessary  to  prove  any  nontrivial  communication  lower  bound,  since  without  it  an 
algorithm  could  “spin  its  wheels”  by  repeatedly  filling  in  and  zeroing  out  entries,  doing  an 
arbitrary  amount  of  arithmetic  with  no  memory  traffic  at  all. 

The  second  condition  holds  for  every  correct  algorithm  for  QR  decomposition  that  does 
not  violate  the  first  condition.  This  condition  means  any  later  Householder  transformation 
(i U{k ))  that  depends  on  earlier  Householder  transformations  {U{k\),...,U{kf))  creating  ze¬ 
roes  in  a  common  column  c  may  operate  only  “within”  the  rows  zeroed  out  by  the  earlier 
Householder  transformations.  We  motivate  this  assumption  in  [27,  Appendix  B]  by  showing 
that  if  an  algorithm  violates  the  second  condition,  it  can  “get  stuck.”  This  means  that  it 
cannot  achieve  triangular  form  without  filling  in  a  deliberately  created  zero. 


4. 3. 2. 2  Roots  and  Destinations 

In  order  to  reason  more  directly  about  temporary  operands,  we  categorize  in  more  detail  the 
data  available  in  fast  memory  for  3NL  computation.  To  this  end,  we  consider  each  input 
or  output  operand  of  functions  that  appears  in  fast  memory  during  a  segment  of  M 
slow  memory  operations.  It  may  be  that  an  operand  appears  in  fast  memory  for  a  while, 
disappears,  and  reappears,  possibly  several  times.  For  each  period  of  continuous  existence 
of  an  operand  in  fast  memory,  we  label  its  Root  (how  it  came  to  be  in  fast  memory)  and  its 
Destination  (what  happens  when  it  disappears): 
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•  Root  R1 :  The  operand  was  already  in  fast  memory  at  the  beginning  of  the  segment, 
and/or  read  from  slow  memory.  There  are  at  most  2 M  such  operands  altogether, 
because  the  fast  memory  has  size  M,  and  because  a  segment  contains  at  most  M  reads 
from  slow  memory. 

•  Root  R2:  The  operand  is  computed  (created)  during  the  segment.  Without  more 
information,  there  is  no  bound  on  the  number  of  such  operands. 

•  Destination  D1 :  An  operand  is  left  in  fast  memory  at  the  end  of  the  segment  (so  that 
it  is  available  at  the  beginning  of  the  next  one),  and/or  written  to  slow  memory.  There 
are  at  most  2 M  such  operands  altogether,  again  because  the  fast  memory  has  size  M, 
and  because  a  segment  contains  at  most  M  writes  to  slow  memory. 

•  Destination  D2\  An  operand  is  neither  left  in  fast  memory  nor  written  to  slow  memory, 
but  simply  discarded.  Again,  without  more  information,  there  is  no  bound  on  the 
number  of  such  operands. 

We  may  correspondingly  label  each  period  of  continuous  existence  of  any  operand  in  fast 
memory  during  one  segment  by  one  of  four  possible  labels  R.i/Dj,  indicating  the  Root  and 
Destination  of  the  operand  at  the  beginning  and  end  of  the  period.  Based  on  the  above 
description,  the  total  number  of  operands  of  all  types  except  R2/D2  is  bounded  by  4 M  (the 
maximum  number  of  R1  operands  plus  the  number  of  D1  operands,  an  upper  bound).  The 
R2/D2  operands,  those  created  during  the  segment  and  then  discarded  without  causing  any 
slow  memory  traffic,  cannot  be  bounded  without  further  information. 

4.3. 2.3  Bounding  Z  Values 

With  the  assumptions  of  Section  4.3.2. 1,  we  begin  the  argument  to  bound  from  below  the 
number  of  memory  operations  required  to  apply  the  set  of  Householder  transformations. 
As  in  the  proof  of  Theorem  4.2,  we  will  focus  our  attention  on  an  arbitrary  segment  of 
computation  in  which  there  are  O(M)  non-R2/D2  entries  in  fast  memory.  Our  goal  will 
be  to  bound  the  number  of  multiplications  in  a  segment  involving  R2/D2  entries,  since  the 
number  of  remaining  multiplications  can  be  bounded  as  before.  From  here  on,  let  us  denote 
by  Z2(k,j)  the  element  Z(k,j)  if  it  is  R2/D2,  and  by  Zn(k,j )  if  it  is  non-R2/D2.  We  will 
further  focus  our  attention  within  the  segment  on  the  update  of  an  arbitrary  column  of  the 
matrix,  A(:,j). 

Each  Z(k,j )  in  memory  is  associated  with  one  Householder  vector  U(:,k)  which  will 
update  A(:,j).  We  will  denote  the  associated  Householder  vector  by  U2(:,k)  if  Z(k,j)  = 
Z2(k,j )  is  R2/D2  and  U„(:,k )  if  Z(k,j)  =  Zn(k,j )  is  non-R2/D2.  With  this  notation,  we 
have  the  following  two  lemmas  which  make  it  easier  to  reason  about  what  happens  to  A(:,j) 
during  a  segment. 

Lemma  4.19.  If  Z2(k,j)  is  in  memory  during  a  segment,  then  U2(:,  k )  as  well  as  the  entries 
A(rowsJJ(k),j)  are  in  memory  during  the  segment. 
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Proof.  Since  Z2(k,j)  is  discarded  before  the  end  of  the  segment  and  may  not  be  re-computed 
later,  the  entire  A(:,j)  =  A(:,j)—U(r  k)-Z2(k,j)  computation  has  to  end  within  the  segment. 
Thus,  all  entries  involved  must  be  resident  in  memory.  □ 

However,  even  if  a  Zn(k,j )  is  in  memory  during  a  segment,  the  Un (: ,  k )  •  Zn(k,j)  compu¬ 
tation  will  possibly  not  be  completed  during  the  segment,  and  therefore  the  Un(:,k )  vector 
and  corresponding  entries  of  A(:,j)  may  not  be  completely  represented  in  memory. 

Lemma  4.20.  If  Z2(ki,j)  and  Z2(k2,j )  are  in  memory  during  a  segme?it,  and  U{k\)  < 
U(k )  <  U[k2)  in  the  PO,  then  Z(k,j)  must  also  be  in  memory  during  the  segment. 

Proof.  This  follows  from  our  first  assumption  that  all  T  matrices  are  lxl  and  the  partial 
order  is  imposed.  Since  U{ki)  <  U(k),  Z(k,j)  cannot  be  fully  computed  before  the  segment. 
Since  U[k)  <  U(k2),  U(:,k )  ■  Z(k,j)  has  to  be  performed  in  the  segment  too,  at  least  enough 
to  carry  the  dependency,  so  Z(k,j )  cannot  be  fully  computed  after  the  segment.  That  is, 
if  U(:,k )  is  Un(:,k),  not  all  rowsJJ(k)  rows  of  A(:,j)  must  be  updated,  but  enough  for 
Z2(k2,j)  to  be  computed  and  U2(:,k2 )  •  Z2(k2,j)  to  be  applied  correctly.  Thus,  Z(k,j)  is 
computed  during  the  segment  and  therefore  must  exist  in  memory.  Note  that  a  partial  sum 
of  ([/(:,  k))T  ■  A(:,j)  may  have  been  computed  before  the  beginning  of  the  segment  and  used 
in  the  segment  to  compute  Zn(k,  j),  but  the  final  Zn(k,j )  value  cannot  be  computed  until 
the  segment.  □ 

Roughly  speaking,  our  goal  now  is  to  bound  the  number  of  U2(r,  k )  •  Z2(k,j )  multiplica¬ 
tions  by  the  number  of  multiplications  in  a  different  matrix  multiplication  U  ■  Z  where  we 
can  bound  the  number  of  U  entries  by  the  number  of  U  entries  in  memory,  and  bound  the 
number  of  Z  entries  by  the  number  of  A  entries  plus  the  number  of  Zn  entries  in  memory, 
which  lets  us  use  the  geometric  bound  of  Lemma  2.7. 

Given  a  particular  segment  and  column  j,  we  construct  U  by  first  partitioning  the  U2(:,  k ) 
by  their  coLsrc TJ (k)  and  then  collapsing  each  partition  into  one  column  of  U .  Likewise, 
collapse  Z(:,j)  by  partitioning  its  rows  corresponding  to  the  partitioned  columns  of  U  and 
taking  the  union  of  TAN  entries  in  each  set  of  rows  to  be  the  TAN  entries  of  the  corresponding 
row  of  Z(:,j).  More  formally, 

Definition  4.21  (U  and  Z ).  For  a  given  segment  of  computation  and  column  j  of  A,  we  set 
U(r,c)  to  be  TAN  if  there  exists  a  U2(:,k )  in  fast  memory  such  that  c  =  coLsrcJJ(k)  and 
r  G  rowsJJ(k).  We  set  Z(c,j )  to  be  TAN  if  there  exists  a  Z2(k,j)  in  fast  memory  such  that 
c  =  coLsrc  JJ  (k) . 

We  will  “emulate”  the  computation  A(:,j)  =  A(:,j)~Yl  U2(:,  k)-Z2(k,j )  with  the  related 
computation  A(:,j)  =  A(:,j)  —  ^{7(:,c)  ■  Z(c,j )  in  the  following  sense:  we  will  show  that 
the  number  of  multiplications  done  by  U2(:,  k )  ■  Z2{k,j)  is  within  a  factor  of  2  of  the  number 
of  multiplications  done  by  U(:,c )  •  Z(c,j). 
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The  following  example  illustrates  this  construction  on  a  small  matrix,  where  K2  contains 
three  indices  (i.e.,  there  are  three  Householder  vectors  that  were  computed  to  zero  entries 
in  the  second  column  of  A);  just  TAN  patterns  are  shown. 


• 

• 

U (:,  K2)  = 

•  • 

•  • 

• 

=*•  &(:.  2)  = 

• 

• 

• 

• 

• 

• 

• 

Note  that  we  do  not  care  what  the  TAN  values  of  U  and  Z  are;  this  computation  has 
no  hope  of  getting  a  correct  result  because  the  rank  of  U  ■  Z  is  generally  less  than  the  rank 
of  the  subset  of  U  ■  Z  it  replaces.  We  emulate  in  this  way  only  to  count  the  memory  traffic. 
We  establish  the  following  results  with  this  construction. 

Lemma  4.22.  U(\,c )  has  at  least  half  as  many  TAN  entries,  and  at  most  as  many  TAN 
entries,  as  the  columns  of  U  from  which  it  is  formed. 

Proof.  The  sets  zero_rows ~U(k)  for  k  in  a  partition  (i.e.,  with  the  same  col_srcJ7 (k))  must  be 
disjoint  by  the  forward  progress  assumption,  and  there  are  at  least  as  many  of  these  rows  as 
in  all  the  corresponding  row_dest_I7 (k),  which  could  potentially  all  coincide.  By  Lemma  4.19, 
we  know  that  complete  U2(:,  k)  are  present  (otherwise  they  could,  for  example,  all  be  Givens 
transformations  with  the  same  destination  row,  and  if  zero  rows  were  not  present,  they  would 
all  collapse  into  one  row).  And  so  since  every  entry  of  zerojrowsJJ (k)  contributes  to  a  TAN 
entry  of  U(:,c),  and  zero _rows _£/(£;)  constitutes  at  least  half  of  the  TAN  entries  of  U(k), 
U(:,c)  has  at  least  half  as  many  TAN  entries  as  the  corresponding  columns  of  U. 

If  all  the  U2(:,k)  being  collapsed  have  TAN  entries  in  disjoint  sets  of  rows,  then  U(:,c ) 
will  have  as  many  entries  TAN  as  all  the  U(:,k).  □ 

Because  each  TAN  entry  of  U (:,  k)  contributes  one  scalar  multiplication  to 
A(:,j)  =  A(:,j)  —  U2(:,  k )  ■  Z2(k,j)  and  each  TAN  entry  of  [/(:,  c)  contributes  one  scalar 

multiplication  to  A(:,j)  =  A(:,j)  —  Jj]f7(:,c)  •  Z(c,j),  we  have  the  following  corollary. 

Corollary  4.23.  U(:,  c )  ■  Z(c,j )  does  at  least  half  as  many  multiplications  as  all  the  corre¬ 
sponding  U2(:,k)  ■  Z2(k,j). 

In  order  to  bound  the  number  of  U ■  Z  multiplications  in  the  segment,  we  must  also  bound 
the  number  of  Z  entries  available. 
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Lemma  4.24.  The  number  of  TAN  entries  of  Z(:,j)  is  bounded  by  the  number  of  A(:,j) 
entries  plus  the  number  of  Zn(:,j)  entries  resident  in  memory. 

Proof.  Our  goal  is  to  construct  an  injective  mapping  I  from  the  set  of  of  Z(:,j )  entries  to 
the  union  of  the  sets  of  A(:,j)  and  Zn(:,j )  entries.  Consider  the  set  of  Z(k,j )  entries  (both 
R2/D2  and  non-R2/D2)  in  memory  as  vertices  in  a  graph  G.  Each  vertex  has  a  unique  label 
k  (recall  that  j  is  fixed),  and  we  also  give  each  vertex  two  more  non- unique  labels:  2  or  n 
to  denote  whether  the  vertex  is  Z2(k,j)  or  Zn(k,j)  and  coLsrc J7(/c)  to  denote  the  column 
source  of  the  corresponding  Householder  vector.  A  directed  edge  {k\,  k2)  exists  in  the  graph 
if  U(:,ki)  <  U(:,k2 )  in  the  PO.  Note  that  all  the  vertices  labeled  both  2  and  c  are  Z2(k,j) 
that  lead  to  Z(c,j )  being  TAN  in  Definition  4.21. 

For  all  values  of  c  =  coLsrc  JJ (k)  appearing  as  labels  in  G,  in  order  of  which  node  labeled 
c  is  earliest  in  PO  (not  necessarily  unique),  find  a  (not  necessarily  unique)  node  k  with  label 
coLsrc J7(/c)  =  c,  that  has  no  successors  in  G  with  the  same  label  c.  If  this  node  is  also 
labeled  n,  then  we  let  X  map  Z{c,j )  to  Zn(k,j).  If  node  k  is  labeled  2,  then  we  let  X  map 
Z(c,j )  to  A(row_dest  J7(/c),  j).  By  Lemma  4.19,  this  entry  of  A  must  be  in  fast  memory. 

We  now  argue  that  this  mapping  X  is  injective.  The  mapping  into  the  set  of  Zn(k,  j) 
entries  is  injective  because  each  Z{c,j )  can  be  mapped  only  to  an  entry  with  column  source 
c.  Suppose  the  mapping  into  the  A(:,j)  entries  is  not  injective,  and  let  Z(c,j )  and  Z(c,j) 
be  the  entries  which  are  both  mapped  to  some  A(r,j).  Then  there  are  entries  Z2(k,j)  and 
Z2(k,j)  such  that  c  =  col  src  U(k),  c  =  col  src  U(k),  r  =  row_destJ7(fc)  =  row_dest -U(k), 
and  neither  k  nor  k  have  successors  in  G  with  the  same  column  source  label. 

Since  rows -U(k)  and  rows -U(k)  intersect,  they  must  be  ordered  with  respect  to  the  PO, 
so  suppose  U(k)  <  U(k).  Consider  the  second  condition  of  FP.  In  this  case,  premises  (2a) 
and  (2b)  hold,  but  the  conclusion  (4.13)  does  not.  Thus,  premise  (2c)  must  not  hold, 
so  there  exists  another  Householder  vector  U(k*)  such  that  c  =  col_srcJ7(/c*)  and  r  e 
zero_rowsJ7  (k*). 

Again,  because  their  nonzero  row  sets  intersect,  each  of  these  Householder  vectors  must 
be  partially  ordered.  By  the  first  condition  of  FP,  since  row.dest _£/(£;)  €  zero_rowsJ7(/C), 
we  have  U(k)  <  U(k*).  Also,  since  U(k*)  satisfies  (2a),  we  have  U(k*)  <  U(k).  Thus, 
U(k)  <  U(k*)  <  U[k),  and  by  Lemma  4.20,  Z(k*,j)  must  also  be  in  fast  memory  and 
therefore  in  G.  Since  Z{k*,j)  is  a  successor  of  Z{k,j )  in  G,  we  have  a  contradiction.  □ 

Theorem  4.25.  An  algorithm  which  applies  orthogonal  transformations  to  annihilate  matrix 
entries,  does  not  compute  T  matrices  of  dimension  2  or  greater  for  blocked  updates,  maintains 
forward  progress  as  in  Definition  f.18,  and  performs  G  flops  of  the  form  U  ■  Z  (as  defined 
in  Equation  (4.12)j,  has  a  bandwidth  cost  of  at  least  Ut(G/\fM)  —  M  words.  In  the  special 
case  of  a  dense  m-by-n  matrix  with  m  >  n,  this  lower  bound  is  Ul(mn2 / \flM). 

Proof.  We  first  argue  that  the  number  of  A,  U,  and  Zn  entries  available  during  a  segment 
are  all  O(M). 

Every  A{i,j)  operand  is  destined  either  to  be  output  (i.e.,  Dl)  or  converted  into  a 
Householder  vector.  Every  A{i,j )  operand  is  either  read  from  memory  (i.e.,  Rl)  or  created 
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on  the  fly  due  to  sparse  fill-in.  So  the  only  possible  R2/D2  operands  from  A  are  entries 
which  arc  filled  in  and  then  immediately  become  Householder  vectors,  and  hence  become  R2 
operands  of  U .  We  bound  the  number  of  these  as  follows. 

All  U  operands  are  eventually  output,  as  they  compose  Q.  So  there  are  no  D2  operands 
of  U  (recall  that  we  may  compute  each  result  U(i,k )  only  once,  so  it  cannot  be  discarded). 
So  all  R2  operands  U(i,  k)  are  also  Dl,  and  so  there  are  at  most  2 M  of  them  (since  at  most 
M  can  remain  in  fast  memory,  and  at  most  M  can  be  written  to  slow  memory,  by  the  end 
of  the  segment).  This  also  bounds  the  number  of  R2/D2  operands  A(i,j),  and  so  bounds 
the  total  number  of  A(i,j)  operands  by  6 M  (the  sum  of  2 M  =  maximum  number  of  Dl 
operands  plus  2 M  =  maximum  number  of  R1  operands  plus  2 M  =  maximum  number  of 
R2/D2  operands). 

The  number  of  Zn  entries  available  in  a  segment  is  bounded  by  2 M  because  by  definition, 
all  entries  are  non-R2/D2. 

From  Lemma  4.22,  the  number  of  U  entries  available  is  0(M )  because  it  is  bounded  by 
the  number  of  U2  entries  which  is  in  turn  bounded  by  the  number  of  U  entries.  From  Lemma 
4.24,  the  number  of  Z  entries  available  is  0(M )  because  it  is  bounded  by  the  sum  of  the 
number  of  entries  of  A  and  of  Zn. 

Thus,  since  the  number  of  entries  of  each  operand  available  in  a  segment  are  O(M), 
by  Lemma  2.7,  the  number  of  U  ■  Z  scalar  multiplications  is  bounded  by  O  (M3/2).  By 
Corollary  4.23,  the  number  of  U  ■  Z  scalar  multiplications  within  a  segment  is  also  bounded 
by  O  (M3/2) . 

Since  there  are  O(M)  Zn(k,j)  operands  in  a  segment,  the  Loomis- Whitney  argument 
bounds  the  number  of  multiplies  involving  such  operands  by  0(M 3//2),  so  with  the  above 
argument  that  bounds  the  number  of  multiplies  involving  R2/D2  Z(k,j)  operands,  the  total 
number  of  multiplies  involving  both  R2/D2  and  non-R2/D2  Z  entries  is  O  (M3/2). 

The  rest  of  the  proof  is  similar  to  before:  a  lower  bound  on  the  number  of  segments  is 
then  [G/O  (M3/2)  J  >  G/O  (M3/2)  —  1,  so  a  lower  bound  on  the  number  of  slow  memory 
accesses  is  M  ■  \_G /O  (M3/2)  J  >  Ll  (G/M1/2)  —  M.  For  dense  m-by-n  matrices  with  m  >  n, 
the  conventional  algorithm  does  G  =  @(mn2)  multiplies.  □ 

See  Sections  7.1.5  and  8.1.5  for  algorithms  that  satisfy  the  assumptions  and  attain  this 
bound  for  dense  matrices. 

4.3. 2.4  Counting  Arithmetic  Operations 

It  is  natural  to  wonder  whether  the  G  operations  in  Theorem  4.25  (and  Corollary  4.16) 
capture  a  constant  fraction  of  the  arithmetic  operations  performed  by  the  algorithm,  which 
would  allow  us  to  deduce  that  the  lower  bound  is  asymptotically  as  large  as  possible.  The  G 
operations  are  just  the  multiplications  in  all  the  different  applications  of  block  Householder 
transformations  A  :=  A  —  U  ■  Z ,  where  Z  =  T  ■  UT  ■  A.  We  argue  that  under  a  natural 
“genericity  assumption”  this  constitutes  a  large  fraction  of  all  the  multiplications  in  the 
algorithm  (although  this  is  not  necessary  for  our  lower  bound  to  be  valid).  Suppose  ( UT  ■ 
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A)(k,j)  is  nonzero;  the  amount  of  work  to  compute  this  is  at  most  proportional  to  the  total 
number  of  entries  stored  (and  so  treated  as  nonzeros)  in  column  k  of  U.  Since  T  is  triangular 
and  nonsingular,  this  means  Z(k,j )  will  be  generically  nonzero  as  well  and  will  be  multiplied 
by  column  k  of  U  and  added  to  column  j  of  A,  which  costs  at  least  as  much  as  computing 
( UT  ■  A)(k,j).  The  cost  of  the  rest  of  the  computation,  forming  and  multiplying  by  T  and 
computing  the  actual  Householder  vectors,  are  lower  order  terms  in  practice;  the  dimension 
of  T  is  generally  chosen  small  enough  by  the  algorithm  to  try  to  assure  this. 

4.3.3  Generalizing  to  Eigenvalue  and  Singular  Value  Reductions 

Standard  algorithms  for  computing  eigenvalues  and  eigenvectors,  or  singular  values  and 
singular  vectors  (the  SVD),  start  by  applying  orthogonal  transformations  to  both  sides  of 
A  to  reduce  it  to  a  “condensed  form”  (Hessenberg,  tridiagonal  or  bidiagonal)  with  the  same 
eigenvalues  or  singular  values,  and  simply  related  eigenvectors  or  singular  vectors  [57].  We 
can  extend  our  argument  for  one-sided  orthogonal  transformations  to  these  computations. 
We  can  have  some  arbitrary  interleaving  of  (block)  Householder  transformations  applied  on 
the  left, 

A  =  (/  -  UL  ■  Tl  •  UTL)  ■  A  =  A  -  UL  ■  (Tl  ■  Ul  ■  A)  =  A  -  UL  ■  ZL, 

where  we  define  Zl  =  Tl  ■  f/J  •  A,  and  the  right, 

A  =  4  •  (/  -  UR  ■  Tr  ■  Ul)  =  A  -  (A  •  UR  ■  TR)  •  UTR  =  A  -  ZR  ■  U%, 

where  we  define  Zr  =  A  ■  Ur  ■  Tr.  Combining  these,  we  can  index  the  computation  by 
Householder  vector  number  similarly  to  Equation  (4.12): 

A (i,j)  =  A(i,j)  -  Y  Ul(i,  kL)  ■  ZL(kL,j)  -  Y  zr(l  M '  urU>  M  (4-14) 

kL  fcfl 

Of  course  there  are  lots  of  possible  dependencies  ignored  here,  much  as  we  wrote  down  a 
similar  formula  for  one-sided  transformations.  At  this  point  we  can  apply  either  of  the  two 
lower  bound  arguments  from  before:  we  can  either  assume  (1)  the  number  of  Householder 
vectors  is  small  so  that  the  number  of  temporary  ZL  and  Zr  values  are  bounded,  applying 
Theorem  4.10,  or  (2)  all  T  matrices  are  lxl  and  we  make  “forward  progress,”  using  the 
argument  in  the  proof  of  Theorem  4.25.  In  case  (1)  we  obtain  a  similar  result  to  Corollary 
4.16: 

Corollary  4.26.  The  bandwidth  cost  lower  bound  for  applying  two-sided  orthogonal  updates 
is  G/(8\/M)  —  M  —  t,  where  G  is  the  number  of  scalar  multiplications  involved  in  the  compu¬ 
tation  ofUiZL  and  ZrUr  (as  specified  in  Equation  (4.14),)  and  t  is  the  number  of  nonzeros 
in  ZL  and  ZR.  In  the  special  case  of  reducing  an  m  x  n  matrix  to  bidiagonal  form  using 
only  a  constant  number  of  Householder  vectors  per  row  and  column,  this  lower  bound  is 
fl(mn2/y/M).  In  the  special  case  of  reducing  an  n  x  n  matrix  to  tridiagonal  or  Hessenberg 
form  using  only  a  constant  number  of  Householder  vectors  per  column,  this  lower  bound  is 
D(n3/VM). 
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In  case  (2),  the  second  lower  bound  argument  requires  a  little  more  discussion  to  clarify 
the  definitions  of  the  partial  order  (Definition  4.17)  and  forward  progress  (Definition  4.18). 
There  will  be  two  partial  orders,  one  for  Ul  and  one  for  Ur.  In  parts  1  and  2  of  Defini¬ 
tion  4.18,  we  insist  that  no  transformation  (from  left  or  right)  fills  in  or  re- zeros  out  an  entry 
deliberately  zeroed  out  by  another  transformation  (left  or  right).  This  implies  that  there  is 
an  ordering  between  left  and  right  transformations,  but  we  do  not  need  to  use  this  order  for 
our  counting  argument.  We  also  insist  that  part  3  of  Definition  4.18  holds  independently  for 
the  left  and  for  the  right  transformations. 

With  these  minor  changes,  we  see  that  the  lower  bound  argument  of  Section  4.3.1  applies 
independently  to  Ul-Zl  and  Zr-U J(.  In  particular,  insisting  that  left  (right)  transformations 
cannot  fill  in  or  re-zeros  out  entries  deliberately  zeroed  out  by  right  (left)  transformations 
means  that  number  of  arithmetic  operations  performed  by  the  the  left  and  right  transforma¬ 
tions  can  be  bounded  independently  and  added.  This  leads  to  the  same  lower  bound  on  the 
number  of  words  moved  as  before  (in  a  Big-Oh  sense): 

Theorem  4.27.  An  algorithm  which  applies  two-sided  orthogonal  transformations  to  anni¬ 
hilate  matrix  entries,  does  not  compute  T  matrices  of  dimension  2  or  greater  for  blocked 
updates,  maintains  forward  progress  as  in  Definition  4-18,  and  performs  G  flops  of  the  form 
U-Z  (as  defined  in  Equation  (4.14)j,  has  a  bandwidth  cost  of  at  least  Ll{G / y/M)  —  M  words. 
In  the  special  case  of  reducing  a  dense  m-by-n  matrix  to  bidiagonal  form  with  m  >  n,  this 
lower  bound  is  D(mn2 / a/M)  .  In  the  special  case  of  reducing  an  n  x  n  matrix  to  tridiagonal 
or  Hessenberg  form,  this  lower  bound  is  D(n3  /  yfM) . 

4.3.4  Applicability  of  the  Lower  Bounds 

While  we  conjecture  that  all  classical  algorithms  for  applying  one-  or  two-sided  orthogonal 
transformations  are  subject  to  a  lower  bound  in  the  form  of  Theorem  4.2,  not  all  of  those 
algorithms  meet  the  assumptions  of  either  of  the  two  lower  bound  arguments  presented  in 
this  section.  However,  many  standard  and  efficient  algorithms  do  meet  the  criteria;  see 
Chapters  7  and  8  for  a  full  discussion  of  these  algorithms. 

For  example,  algorithms  for  QR  decomposition  that  satisfy  this  assumption  of  Corollary 
4.16  include  the  blocked,  right-looking  algorithm  (currently  implemented  in  (Sca)LAPACK 
[8,  44])  and  the  recursive  algorithm  of  Elmroth  and  Gustavson  [66].  The  simplest  version 
of  Communication- A  voiding  QR  (i.e.,  one  that  does  not  block  transformations,  see  last 
paragraph  in  Section  6.4  of  [62])  satisfies  the  assumptions  of  Theorem  4.25.  However,  most 
practical  implementations  of  CAQR  do  block  transformations  to  increase  efficiency  in  other 
levels  of  the  memory  hierarchy,  and  neither  proof  applies  to  these  algorithms.  The  recursive 
QR  decomposition  algorithm  of  Frens  and  Wise  [69]  is  also  communication  efficient,  but 
again  our  proofs  do  not  apply. 

Further,  Corollary  4.26  applies  to  the  conventional  blocked,  right-looking  algorithms  in 
LAPACK  [8]  and  ScaLAPACK  [44]  for  reduction  to  Hessenberg,  tridiagonal  and  bidiagonal 
forms.  Our  lower  bound  also  applies  to  the  first  phase  of  the  successive  band  reduction 
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algorithm  of  Bischof,  Lang,  and  Sun  [38,  43],  namely  reduction  to  band  form,  because  this 
satisfies  our  requirement  of  forward  progress.  However,  the  second  phase  of  successive  band 
reduction  does  not  satisfy  our  requirement  of  forward  progress,  because  it  involves  bulge 
chasing,  which  repeatedly  creates  nonzero  entries  outside  the  band  and  zeroes  them  out 
again.  But  since  the  first  phase  does  asymptotically  more  arithmetic  than  the  second  phase, 
our  lower  bound  based  just  on  the  first  phase  cannot  be  much  improved  (see  Chapter  10  for 
more  discussion  of  these  and  other  algorithms). 

There  are  many  other  eigenvalue  computations  to  which  these  results  may  apply.  For 
example,  the  lower  bound  applies  to  reduction  of  a  matrix  pair  (A,  B)  to  upper  Hessenberg 
and  upper  triangular  form.  This  is  done  by  a  QR  decomposition  of  B,  applying  QT  to  A  from 
the  left,  and  then  reducing  A  to  upper  Hessenberg  form  while  keeping  B  in  upper  triangular 
form.  Assuming  one  set  of  assumptions  applies  to  the  algorithm  used  for  QR  decomposition, 
the  lower  bound  applies  to  the  first  two  stages  and  reducing  A  to  Hessenberg  form.  However, 
since  maintaining  triangular  form  of  B  in  the  last  stage  involves  filling  in  entries  of  B  and 
zeroing  them  out  again,  onr  argument  does  not  directly  apply.  This  computation  is  a  fraction 
of  the  total  work,  and  so  this  fact  would  not  change  the  lower  bound  in  an  asymptotic  sense. 


4.4  Attainability 

In  this  chapter,  we  obtain  lower  bounds  for  many  different  computations — most  of  classical 
linear  algebra.  The  greatest  value  in  establishing  these  lower  bounds  is  that  it  sets  a  target 
for  algorithmic  development.  For  a  given  computation,  we  can  determine  if  the  best  algo¬ 
rithms  attain  the  lower  bounds  (and  are  asymptotically  optimal),  or  we  can  identify  a  gap 
between  algorithms  (upper  bounds)  and  lower  bounds.  In  this  case,  tighter  theoretical  anal¬ 
ysis  may  allow  for  higher  lower  bounds,  or  algorithmic  innovation  may  lead  to  more  efficient 
approaches  to  solving  the  problem.  Once  a  communication-optimal  algorithm  is  identified, 
we  know  that  significant  algorithmic  improvements  are  no  longer  possible,  and  we  can  focus 
our  attention  on  tuning  the  implementations  to  particular  hardware  platforms  to  maximize 
actual  performance.  We  will  discuss  state-of-the-art  algorithms  and  the  attainability  of  the 
lower  bounds  presented  here  in  Chapters  7  and  8. 
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Chapter  5 

Lower  Bounds  for  Strassen’s  Matrix 
Multiplication 


While  the  approach  for  proving  communication  lower  bounds  discussed  in  Chapter  4  works 
for  many  algorithms  in  linear  algebra,  it  no  longer  applies  when  distributivity  is  used,  as  in  the 
case  of  Strassen’s  matrix  multiplication  algorithm.  In  this  case,  we  consider  the  computation 
directed  acyclic  graph  (CDAG)  of  an  algorithm.  While  our  approach  of  computation  graph 
analysis  is  similar  to  the  red-blue  pebble  game  of  Hong  and  Kung  [88],  we  connect  the 
communication  costs  of  an  algorithm  to  the  expansion  properties  of  its  computation  graph. 

The  expansion  of  a  graph  relates  the  number  of  vertices  in  a  subset  of  the  graph  to  its 
neighbors  in  the  complement;  see  Definition  2.8  for  a  more  rigorous  definition.  In  the  case 
of  a  computation  graph,  the  vertices  correspond  to  arithmetic  operations,  and  the  edges 
(particularly  the  ones  between  a  given  subset  of  vertices  and  its  complement)  correspond 
to  communication.  The  analysis  of  the  expansion  properties  of  Strassen’s  CDAG  relies  on 
the  recursive  property  of  the  algorithm  and  can  be  extended  to  other  recursive  algorithms. 
Indeed,  other  fast  algorithms  for  matrix  multiplication  are  recursive,  and  we  obtain  lower 
bound  results  for  many  of  those  algorithms  (see  Chapter  6). 

This  chapter  is  organized  as  follows.  In  Section  5.1  we  discuss  the  relationship  between 
the  expansion  of  an  algorithm’s  CDAG  and  its  communication  costs.  We  analyze  the  CDAG 
for  Strassen’s  algorithm  in  particular  in  Section  5.2,  and  in  Section  5.3  we  combine  these 
results  in  the  form  of  a  communication  lower  bound  for  Strassen’s  algorithm. 

The  main  contributions  of  this  chapter  are 

•  introducing  a  new  proof  technique  of  relating  communication  lower  bounds  to  a  com¬ 
putation’s  dependency  graph  (CDAG)  and 

•  using  the  technique  to  prove  a  communication  lower  bound  for  Strassen’s  matrix  mul¬ 
tiplication  algorithm. 

The  results  and  proofs  in  this  chapter  appear  in  [25]  (conference  version),  which  was 
awarded  the  Best  Paper  prize  at  the  ACM  Symposium  on  Parallelism  in  Algorithms  and 
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Architectures  (SPAA)  in  2011,  and  [26]  (journal  version),  written  with  coauthors  James 
Demmel,  Olga  Holtz,  and  Oded  Schwartz. 


5.1  Relating  Edge  Expansion  to  Communication 

In  this  section  we  recall  the  notion  of  the  computation  graph  of  an  algorithm,  then  show 
how  a  partition  argument  connects  the  expansion  properties  of  the  computation  graph  to 
the  communication  requirements  of  the  algorithm.  The  partition  argument  is  the  same  as 
the  one  used  in  Chapter  4. 

5.1.1  Computation  Graph 

For  a  given  algorithm,  we  let  G  =  (V,  E )  be  the  directed  acyclic  graph  corresponding  to  the 
computation  (CDAG),  where  there  is  a  vertex  for  each  input  element  and  each  arithmetic 
operation  (AO)  performed.  The  graph  G  contains  a  directed  edge  (u,v),  if  the  output 
operand  of  the  AO  corresponding  to  u  (or  the  input  element  corresponding  to  u) ,  is  an  input 
operand  to  the  AO  corresponding  to  v.  The  in-degree  of  any  vertex  of  G  is,  therefore,  at 
most  2  (as  the  arithmetic  operations  are  binary).  The  out-degree  is,  in  general,  unbounded 
(he.,  it  may  be  a  function  of  |V|).  We  next  show  how  an  expansion  analysis  of  this  graph 
can  be  used  to  obtain  a  communication  lower  bound  for  the  corresponding  algorithm. 

5.1.2  Partition  Argument 

Let  M  be  the  size  of  the  fast  memory.  Let  O  be  any  total  ordering  of  the  vertices  that 
respects  the  partial  ordering  of  the  CDAG  G  (this  total  ordering  corresponds  to  the  actual 
order  in  which  the  computations  are  performed).  Let  P  be  any  partition  of  V  into  segments 
Si,  S2, ,  so  that  a  segment  Si  G  P  is  a  subset  of  the  vertices  that  are  contiguous  in  the 
total  ordering  O. 

For  each  segment  S,  let  Rs  and  Ws  be  the  set  of  read  and  write  operands,  respectively 
(see  Figure  5.1).  Namely,  Rs  is  the  set  of  vertices  outside  S  that  have  an  edge  going  into  S, 
and  Ws  is  the  set  of  vertices  in  S  that  have  an  edge  going  outside  S.  Then  the  total  number 
of  reads  of  AOs  to  perform  the  computation  in  S  is  at  least  \R.s\  ~  M,  as  at  most  M  of  the 
needed  Rs  operands  are  already  in  fast  memory  when  the  execution  of  the  segment  starts. 
Similarly,  S  causes  at  least  Ws \  —M  actual  write  operations,  as  at  most  M  of  the  operands 
needed  by  other  segments  are  left  in  the  fast  memory  when  the  execution  of  the  segment 
ends.  The  total  bandwidth  cost  is  therefore  bounded  below  by 

W  >  max  (|fls|  +  \WS\  ^  2 M) . 

SeP 


(5.1) 
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Figure  5.1:  A  segment  of  computation  S  and  its  corresponding  read  operands  Rs  and  write 
operands  IF  5  • 


5.1.3  Edge  Expansion  and  Communication 

Recall  the  definition  of  edge  expansion  (Definition  2.8).  We  can  use  the  edge  expansion  of  a 
CDAG  to  relate  a  segment  S  to  its  read  and  write  operands: 

Claim  5.1.  Consider  a  segment  S  and  its  read  and  write  operands  Rs  and  Ws ■  If  the  graph 
G  (with  edge  directions  ignored)  containing  S  has  edge  expansion  h(G),  maximum  degree  d, 
and  at  least  2 1 vertices,  then  we  have  |i?s|  +  \Ws\  >  h(G)  ■  IS). 

Proof.  We  have  | E(S,  V\S)\>  h(G)  ■  d  •  \S\.  Since  E(S,  V\S)  —  E{RS,  S )  tb)  E(WS,  V\S) 
we  have  \E(S,  V  \  S'))  =  | E(Rs,  S)|  +  \E(Ws,  E  \  <5)1  <  d  ■  |i?s|  +  d  ■  \Ws\  where  the  last 
inequality  is  by  the  degree  bound.  The  claim  follows.  □ 

Combining  Claim  5.1  with  Equation  (5.1)  and  choosing  to  partition  V  into  |E|/s  segments 
of  equal  size  s,  we  obtain: 

W  >  max  M  •  (h(G)  ■  s  -  2 M)  =  LI  (|E|  •  h(G)) , 

S<-g-  s 

for  sufficiently  large  \V\.  In  many  cases  h(G)  is  too  small  to  attain  the  desired  bandwidth 
cost  lower  bound.  Typically,  h(G)  is  a  decreasing  function  in  |E(G)|  (i.e.,  the  edge  expansion 
deteriorates  as  the  input  size  and  number  of  arithmetic  operations  increase).  This  is  the  case 
with  Strassen’s  matrix  multiplication  algorithm.  In  such  cases,  it  is  better  to  consider  the 
expansion  of  G  on  small  sets  (see  Equation  (2.2)): 

|  VI 

W  >  max  J-d  ■  (hs{G)  ■  s  -  2 M) . 
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Choosing  the  minimal  s  so  that 

hs(G)  ■  s  >  3 M  (5.2) 

we  obtain 

\V\ 

W  >  - — -  •  M.  (5.3) 

s 

The  existence  of  a  value  s  <  ^  that  satisfies  condition  (5.2)  is  not  always  guaranteed.  In  the 
next  section  we  confirm  the  existence  of  such  s  for  Strassen’s  CDAG,  for  sufficiently  large 
\V\.  Indeed  this  is  the  interesting  case,  as  otherwise  all  computations  can  be  performed 
inside  the  fast  memory,  with  no  communication,  except  for  reading  the  input  once. 

In  some  cases,  the  computation  graph  G  does  not  fit  these  assumptions:  it  may  not  be 
regular,  it  may  have  vertices  of  unbounded  degree,  or  its  edge  expansion  may  be  hard  to 
analyze.  In  such  cases,  we  may  consider  some  subgraph  G'  of  G  instead  to  obtain  a  lower 
bound  on  the  bandwidth  cost: 


Claim  5.2.  Let  G  =  (V,  E )  be  a  computation  graph  of  an  algorithm  Alg,  and  let  G'  =  (V7,  E ') 
be  a  subgraph  of  G,  i.e.,  V1  C  V  and  E'  C  E.  If  G'  is  d-regular  and  a  =  then,  for 
sufficiently  large  \V'\,  the  bandwidth  cost  of  Alg  is 

W  >  —  ■  —  •  M 
~  2  s 

where  s  is  chosen  so  that  hs(G')  ■  as  >  3 M . 

Proof.  The  correctness  of  this  claim  follows  from  Equations  (5.2)  and  (5.3),  and  from  the 
fact  that  at  least  an  a/2  fraction  of  the  segments  have  at  least  a  ■  s  of  their  vertices  in  G' 
(otherwise  V'  <  |  •  y  •  s  +  (1  —  f )  •  y  •  f  s  <  aV).  If  we  partition  G  into  segments  of  size  s, 
consider  only  those  segments  with  at  least  a  ■  s  of  their  vertices  in  G' .  The  expansion  of  G' 
guarantees  that  there  are  at  least  hs{G')  ■  as  >  3 M  edges  in  E'  that  connect  vertices  in  the 
segment  to  its  complement.  Since  there  at  least  a/2  such  segments,  the  result  follows.  □ 


5.2  Expansion  Properties  of  Strassen’s  Algorithm 

Recall  Strassen’s  algorithm  for  matrix  multiplication  (see  Section  2.4.1)  and  consider  its 
computation  graph  (see  Figure  5.2).  Let  Hi  be  the  computation  graph  of  Strassen’s  algorithm 
for  recursion  depth  i,  so  that  H\gn  corresponds  to  the  computation  for  input  matrices  of  size 
n  x  n.  Then  H\gn  has  the  following  structure: 

•  Encode  A:  generate  weighted  sums  of  elements  of  A  (this  corresponds  to  the  left  factors 
of  lines  5-11  of  the  algorithm). 

•  Similarly  encode  B  (this  corresponds  to  the  right  factors  of  lines  5-11  of  the  algorithm). 
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Figure  5.2:  The  computation  graph  of  Strassen’s  algorithm  (see  Section  2.4.1).  Top  left: 
Dec\C.  Top  right:  Hi,  with  Dec\C  on  the  top  and  Enc\A  and  EnciB  at  the  bottom. 
Bottom  left:  Dec\gnC.  Bottom  right:  H\gn. 


•  Then  multiply  the  encodings  of  A  and  B  element-wise  (this  corresponds  to  line  2  of 
the  algorithm). 

•  Finally,  decode  C,  by  taking  weighted  sums  of  the  products  (this  corresponds  to  lines 
12-15  of  the  algorithm). 

We  let  EnCiA,  EnciB ,  and  DeCiC  be  the  subgraphs  corresponding  to  the  encoding  and 
decoding  of  2*  x  2*  matrices  A,  B ,  and  C,  respectively.  Note  that  Dec\C  is  presented  in 
Figure  5.2,  for  simplicity,  with  vertices  of  in-degree  larger  than  two  (but  constant).  A  vertex 
of  degree  larger  than  two,  in  fact,  represents  a  full  binary  (not  necessarily  balanced)  tree. 
Note  that  replacing  these  high  in-degree  vertices  with  trees  changes  the  edge  expansion  of 
the  graph  by  a  constant  factor  at  most  (as  this  graph  is  of  constant  size,  and  connected). 
Moreover,  there  is  no  change  in  the  number  of  input  and  output  vertices. 

5.2.1  Computation  Graph  for  n-  by  -n  Matrices 

Assume  without  loss  of  generality  that  n  is  an  integer  power  of  2.  Denote  by  Enc\gnA  the 
part  of  H\gn  that  corresponds  to  the  encoding  of  matrix  A.  Similarly,  let  Enc\gnB  and 
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Dec\gnC  correspond  to  the  parts  of  H\gn  that  compute  the  encoding  of  B  and  the  decoding 
of  C,  respectively. 

5. 2. 1.1  Top-Down  Construction 

We  next  construct  the  computation  graph  Hi+1  by  constructing  DeCi+\C  (from  DectC  and 
Dec\C)  and  similarly  constructing  EnCi+\A  and  EnCi+\B ,  then  composing  the  three  parts 
together. 

•  Replicate  Dec\C  7*  times. 

•  Replicate  DectC  four  times. 

•  Identify  the  4  •  7*  output  vertices  of  the  copies  of  Dec\C  with  the  4  •  7*  input  vertices 
of  the  copies  of  DeCiC : 

—  Recall  that  each  Dec\C  has  four  output  vertices. 

—  The  set  of  each  first  output  vertex  of  the  7*  Dec\C  graphs  is  identified  with  the 
set  of  7*  input  vertices  of  the  first  copy  of  DectC. 

~  The  set  of  each  second  output  vertex  of  the  7*  DeciC  graphs  is  identified  with 
the  set  of  7*  input  vertices  of  the  second  copy  of  DeciC,  and  so  on. 

—  We  make  sure  that  the  jth  input  vertex  of  a  copy  of  DeciC  is  identified  with  an 
output  vertex  of  the  jth  copy  of  Dec\C. 

•  We  similarly  obtain  EnCi+\A  from  EnCiA  and  Enc\A  and  also  EnCi+\B  from  EnCjB 
and  Enc\B. 

•  For  every  i,  Hj  is  obtained  by  connecting  edges  from  the  jth  output  vertices  of  EnciA 
and  EnCiB  to  the  jth  input  vertex  of  DeciC . 

This  completes  the  construction.  Let  us  note  an  important  property  of  these  graphs. 

Claim  5.3.  All  vertices  of  Dec\gnC  are  of  degree  at  most  6. 

Proof.  The  graph  Dec\C  has  no  vertices  which  are  both  input  and  output.  As  all  out-degrees 
are  at  most  4  and  all  in  degree  are  at  most  2  the  claim  follows.  □ 

However,  note  that  EnciA  and  Enc\B  do  have  vertices  which  are  both  input  and  output 
( e.g .,  An),  therefore  Enc\gnA  and  Enc\gnB  have  vertices  of  out-degree  @(lgn).  All  in¬ 
degrees  are  at  most  2,  as  an  arithmetic  operation  has  at  most  two  inputs.  As  H\gn  contains 
vertices  of  large  degrees,  it  is  easier  to  consider  Dec\gnC :  it  contains  only  vertices  of  constant 
bounded  degree,  yet  at  least  one  third  of  the  vertices  of  H\gn  are  in  it. 
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5.2. 1.2  Combinatorial  Estimation  of  the  Expansion 

Let  Gk  =  (V,  E)  be  DeckC,  and  let  S  C  V,  (S')  <  |E|/2.  Let  l,  be  the  ith  level  of  vertices  of 
Gk,  so 

4fc  =  \h\<  \l2\  <  ■<  \k\  =  4k~i+17i~1  <■■■<  1 4+i |  =  7k. 

The  following  bounds  on  the  fraction  of  vertices  in  the  first  level  will  be  useful: 

Claim  5.4.  |  ■  (|)‘ <  M  <  | 

Proof.  The  claim  follows  from  the  following  identities: 


k+ 1 


i— 1 


i= 0 


l4+i  I  • 


7 

'3 


□ 

Let  Si  =  S  fl/j.  Let  a  =  be  the  fractional  size  of  S  and  tr*  =  ^  be  the  fractional  size 
of  S  at  level  i .  Due  to  averaging,  there  exist  i  and  %'  such  that  c>i<  o  < 


Claim  5.5.  Let  Si  =  Uj+i  —  a  +  Then  \E(S,  V  \  S)  D  E(/j,  /+|_i)|  >  C\  ■  d  ■  |hj|  •  \li\,  where  C\  is 
a  constant  which  depends  on  G+. 


Proof.  Let  Gl  be  a  G\  component  connecting  U  with  li+\  (so  it  has  four  vertices  in  U  and 
seven  in  h+i).  G'  has  no  edges  in  E(S,  V\S)  if  all  or  none  of  its  vertices  are  in  S.  Otherwise, 
as  G'  is  connected,  it  contributes  at  least  one  edge  to  E(S,  V  \  S).  The  number  of  such  G\ 
components  with  all  their  vertices  in  S  is  at  most  min  |  ,  <n+iT»+ d  |  _  min{<Tj,  <Tj+ 1}  •  ^41. 

Similarly,  the  number  of  such  G\  components  with  none  of  their  vertices  in  S  is  at  most 
min{l  —  <jj,  1  —  ai+ 1}  •  ^i.  Therefore,  there  are  at  least  | a*  —  ai+ 1|  •  +4  components  with 
at  least  one  vertex  in  S  and  one  vertex  that  is  not.  The  claim  follows  with  c  1  =  pp  □ 

Claim  5.6  (Homogeneity  between  levels).  If  there  exists  i  so  that  rT~fT’  >  then 

|S(S,C\5)|  >  c2.d-  |S|  ■  U) 


where  c2  is  a  constant  which  depends  on  G\ . 
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Proof.  Assume  that  there  exists  j  so  that  >  i  By  Claim  5.5,  we  have 


I  E(S,V\S)\  >  £|£(S,V\S)n£(UH-i)l 

i£[k] 

i£[k] 

>  Cl.d.\h\j2m 

i£[k] 

>  C\  ■  d  ■  \l\  \  ■  (  max  at  —  min  a, 

\i£[fc+l]  ie[A;+l] 


By  the  initial  assumption,  there  exists  j  so  that 
then 


>  ,  therefore  max*  a*  —  min*  a,  > 


10’ 


a 


|B(5,V\5)|>c1-<i-|i1|--. 


By  Claim  5.4,  |(,|  >  f  ■  (A'lH 


|£(S,r\S)|>Ci-rf-?.  U 


^  ■  W 


and  since  |5|  =  a  ■  \V\, 

\E(S,V\S)\>c2-d-\Sl-(Lj 

for  any  c2  <  ^  '  f  '  ci  ■  □ 

Let  Tfc  correspond  to  the  recursive  construction  of  Gk  in  the  following  way  (see  Figure 
5.3):  Tk  is  a  tree  of  height  k  +  1,  where  each  internal  node  has  four  children.  The  root  r 
of  Tfc  corresponds  to  4+ 1  (the  largest  level  of  Gk)-  The  four  children  of  r  correspond  to  the 
largest  levels  of  the  four  graphs  that  one  can  obtain  by  removing  the  level  of  vertices  4+i 
from  Gk-  For  every  node  u  of  Tfc,  denote  by  Vu  the  set  of  vertices  in  Gk  corresponding  to 
u,  so  if  u  is  at  level  i  of  Tk  then  Vu  C  One  can  think  of  Tk  as  a  quadtree  partitioning 
of  matrix  C  into  blocks,  where  Vu  is  the  largest  level  of  the  decoding  subgraph  of  the  C 
sub-block  corresponding  to  u.  Therefore  \Vr\  =  7k  where  r  is  the  root  of  Tk,  \VU\  =  7k~l  for 
each  node  u  that  is  a  child  of  r;  in  general  we  have  4*  tree  nodes  u  corresponding  to  a  set  of 
size  \VU\  =  7fc_l+1.  Each  leaf  corresponds  to  a  set  of  size  1. 

For  a  tree  node  u,  let  us  define  pu  =  \S  fl  14|/|14|  to  be  the  fraction  of  S  nodes  in  Vu , 
and  Su  =  | pu  —  pp(u')  | ,  where  p(u)  is  the  parent  of  u  (for  the  root  r  we  let  p(r)  =  r ).  We  let 
U  be  the  ith  level  of  Tk,  counting  from  the  bottom,  so  tk+ 1  is  the  root  and  t\  are  the  leaves. 

Claim  5.7.  As  Vr  =  4+ i  we  have  pr  =  oy,+1.  For  a  tree  leaf  u  G  4,  we  have  \VU\  =  1. 
Therefore  pu  G  {0, 1}.  The  number  of  vertices  u  in  4  with  pu  —  1  is  at  ■  |4|. 
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Figure  5.3:  The  graph  G &  and  its  corresponding  tree  T*.. 


Claim  5.8.  Let  u0  be  an  internal  tree  node,  and  let  U\,  u2l  M3,  114  be  its  four  children.  Then 

'£\E(S,V\S)nE(VuuV„)\>c1-d-J2  I  Pui  Pu 0  I  '  |  Kj  | 

i  i 

where  c \  is  a  constant  that  depends  on  G 1 . 

Proof.  The  proof  follows  that  of  Claim  5.5.  Let  G’  be  a  G\  component  connecting  VUo  with 
Uie[4]  K,  (so  it  has  seven  vertices  in  Vuo  and  one  in  each  of  VU1,VU2,VU3,VU4).  G'  has  no 
edges  in  E(S,  V  \  S)  if  all  or  none  of  its  vertices  are  in  S.  Otherwise,  as  G'  is  connected, 
it  contributes  at  least  one  edge  to  E(S,V  \  S ).  The  number  of  G\  components  with  all 
their  vertices  in  S  is  at  most  min {pUo,  pUl,  pU2,  pU3,  pU4}  •  |Ki|-  Therefore,  there  are  at  least 
maxie  [4]  { |  p 

Uo  Pui  }  •  |UJ  >  J  '  Ek[41  | pUi  —  Pu0\  ■  |K, |  G\  components  with  at  least  one 
vertex  in  S  and  one  vertex  that  is  not.  The  claim  follows  with  c\  —  P 

We  now  state  and  prove  our  main  lemma  on  the  edge  expansion  of  the  decoding  graph 
of  Strassen’s  CDAG: 

Lemma  5.9.  The  edge  expansion  of  Dec^C  is 


h(DeCkG)  =  0 

Proof.  Consider  a  subset  S  of  the  vertices  of  the  decoding  graph.  Recall  that  G k  =  DeCkC  is 
a  layered  graph  (with  layers  corresponding  to  recursion  steps),  so  all  edges  (excluding  loops) 
connect  between  consecutive  levels  of  vertices.  By  Claim  5.6,  each  level  of  Gk  contains  about 
the  same  fraction  of  S  vertices,  or  else  we  have  many  edges  leaving  S.  By  Claim  5.7,  the 
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lowest  level  is  composed  of  distinct  parts  that  cannot  have  homogeneity  (of  the  fraction  of 
their  S  vertices),  otherwise  many  edges  leave  S. 

Let  Tk  and  Vu  be  defined  as  in  the  previous  section.  We  will  show  that  the  homogeneity 
between  levels,  combined  with  the  heterogeneity  of  the  lowest  level,  guarantees  that  there 
are  many  edges  leaving  S. 

We  have 

|£(S,V\S)|  =  J2  \E(S,V\S)nE(V„VpM)\. 

weTfe 

By  Claim  5.8,  this  implies 

|£(S,V\S)|>£  Cl- d  -  |  pu  Pp(u)  I  '  I K I 

«eTfc 

d-d  - y ^ y ^  pu  Pp(u ) i ■  7 

ie[fc]  “Sti 

—  ci  ■  d  ■  y  ^  y  ^  i  pu  —  Pp(u)  i  ■  4 

iG  [Aj]  wGi i 

As  each  internal  node  has  four  children,  this  means 

\E(S,V\S)\  =  d-  I  Pu~Pp(u)l 

v£ti  u£v~r 

where  v  ~  r  is  the  path  from  v  to  the  root  r.  By  the  triangle  inequality  for  the  function  |  •  | 

\E(S,V\S)\  >ci-d-  J2\pu~Pr\- 

vet! 


By  Claim  5.7, 


\E(S,  V\S)  |  >  d  ■  d  ■  |/i|  •  ((1  -  (Ti)  •  pr  +  <Ti  •  (1  -  pr)). 

By  Claim  5.6,  w.l.o.g.,  | cq  —  cr|/cr  <  A  (otherwise  \E(S,V  \  S)\  >  c2  •  d  ■  151  •  (|)fc),  so 
Ao-  <  a*  <  A.  As  a  <  |  and  pr  =  ak+ 1, 

and  by  Claim  5.4, 

|E(s,r\s)|>c3v.|s|.(A  , 

for  any  c3  <  f  •  ^  •  d- 

Thus,  since  d  is  constant  (Claim  5.3),  we  have  =  Ll  ^(|)fcj,  where  the  hidden 

constant  is  C4  =  d  ■  min{c2,  c3}.  □ 
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5.3  Communication  Lower  Bounds 


Theorem  5.10.  Consider  Strassen’s  algorithm  implemented  on  a  sequential  machine  with 
fast  memory  of  size  M .  Then  for  M  <  n2,  and  assuming  no  recomputation,  the  bandwidth 
cost  of  Strassen’s  algorithm  is 


Proof.  Let  k  —  lg  \[M  +  C5  where  C5  is  a  constant  to  be  determined,  and  assume  k  divides 
lg  n  evenly.  Note  that  it  is  sufficient  to  prove  the  result  for  an  infinite  number  of  n' s, 
but  the  smallest  n  for  which  the  proof  holds  is  n  =  2 C5\/M  (so  that  k  =  lg  n) .  This 
assumption  implies  that  Dec\gnC  is  composed  of  edge-disjoint  copies  of  DeckC,  and  we  can 
apply  Lemma  2.10  with  G  =  Dec\gnC ,  G'  =  DeckC ,  and  s  =  \V(G')\/2.  Since  d  and  d'  are 
the  same,  we  have 

hs(Dec\gnC)  >  h(DeckC) 

and  by  Lemma  5.9  this  implies 

hs{Dec\gnC)  C4 


Note  that  7fc/2  <  s  <  2  •  7k. 

We  now  apply  Claim  5.2  with  G  as  the  entire  CDAG  of  Strassen’s  algorithm  of  matrix 
dimension  n  and  G'  =  Dec.\gnC.  Here  a  =  1/3  and 

hs(Dec\gnC )  ■  as  >  ■  4CS  ■  M  >  3 M 

for  C5  >  lg  \J  18 / C4,  so 


W  >  a  ■ 


1C 


•  M 


The  above  inequality  holds  for  M  <  c$  ■  n2,  where  Cq 
note  that 


W  >  n 2 


=  I8/C4  <  1. 

■M 


For  c6  ■  n2  <  M  <  n 2, 


as  one  has  to  read  2 n2  words  of  input  data  and  at  most  n 2  of  them  can  be  in  the  fast  memory 
at  the  start  of  the  computation.  □ 


Theorem  5.10  holds  for  any  implementation  and  any  known  variant  of  Strassen’s  algo¬ 
rithm1  that  is  based  on  performing  2x2  matrix  multiplication  with  7  scalar  multiplications. 

1This  lower  bound  for  the  sequential  case  seems  to  contradict  the  upper  bound  from  [70]  and  later  [46], 
due  to  a  miscalculation  in  the  former  which  is  propagated  in  the  latter  ([104]). 
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This  includes  Winograd’s  0(nlg7)  variant  (see  Section  2.4.2)  that  uses  15  additions  instead 
of  18,  which  is  the  most  used  fast  matrix  multiplication  algorithm  in  practice  [64,  65,  105]. 
See  Section  7.2  for  a  scheduling  of  the  computation  that  attains  this  lower  bound.  Note  that 
Theorem  5.10  does  not  hold  for  values  of  M  which  are  so  large  that  the  entire  problem  can 
fit  into  fast  memory  simultaneously.  In  the  case  that  the  input  matrices  start  in  fast  memory 
and  the  output  matrix  finishes  in  fast  memory,  no  communication  is  necessary. 

For  parallel  algorithms,  we  have: 


Corollary  5.11.  Consider  Strassen’s  algorithm,  implemented  on  a  parallel  machine  with  P 

processors,  each  with  a  local  memory  of  size  M .  There  exists  a  constant  c  such  that  for 
2 

M  <  c  ■  p2/  ig 7  ?  and  assuming  no  recomputation,  the  bandwidth  cost  of  Strassen’s  algorithm 


is 


W 


Proof.  In  the  parallel  case,  we  consider  the  busiest  processor.  Due  to  averaging,  it  must  do 
at  least  (l/p)th  of  the  work.  We  apply  the  same  partitioning  argument  as  in  the  proof  of 
Theorem  5.10  to  that  processor’s  subset  of  computation.  However,  in  order  for  the  proof  to 
work  we  must  require  M  <  cp2/lg7  for  some  constant  c  (rather  than  M  <  n 2  in  the  sequential 
case) .  □ 


While  Corollary  5.11  does  not  hold  for  all  sizes  of  local  memory  (relative  to  the  problem 
size  and  number  of  processors),  there  exists  another  lower  bound  that  holds  for  all  local 
memory  sizes,  though  it  requires  separate  assumptions  (see  Section  6.2).  See  Chapter  11  for 
a  parallel  algorithm  that  attains  this  bound  where  possible. 


5.4  Conclusions 

As  we  explain  in  Chapter  7,  the  lower  bound  of  Theorem  5.10  (for  sequential  machines)  is 
attainable  with  the  natural  recursive  algorithm.  This  has  an  important  (and  somewhat  sur¬ 
prising)  implication.  Re-writing  the  lower  bound  for  classical  square  matrix  multiplication 
given  by  Theorem  3.1  as  R'dassicai  =  D((n/ \/M)3-M),  we  can  more  easily  compare  it  to  Theo¬ 
rem  5.10.  Since  lg  7  <  3,  we  see  that  Strassen’s  algorithm  requires  not  only  less  computation 
than  the  classical  algorithm,  it  also  requires  less  communication!  While  we  often  reduce 
communication  at  the  expense  of  extra  computation,  in  the  case  of  Strassen’s  algorithm,  we 
can  reduce  both  simultaneously.  In  the  parallel  case,  however,  exploiting  this  opportunity  is 
less  straightforward.  The  absence  of  a  parallel  algorithm  that  attained  the  lower  bound  of 
Corollary  5.11  sparked  our  interest  in  developing  a  new  and  more  communication-efficient 
algorithm.  We  present  this  new  algorithm  in  Chapter  11  and  show  that  it  attains  the  lower 
bound  (where  possible),  and  we  discuss  other  algorithms  that  use  Strassen’s  or  other  fast 
matrix  multiplication  algorithms  in  Chapters  7  and  8. 
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Chapter  6 

Extensions  of  the  Lower  Bounds 


In  this  chapter  we  describe  several  extensions  of  the  analysis  and  results  presented  in  the 
previous  chapters.  The  main  contributions  here  are 

•  extending  the  lower  bound  for  Strassen’s  algorithm  to  other  fast  matrix  multiplication 
algorithms, 

•  proving  a  separate  set  of  memory-independent  lower  bounds  for  both  classical  and  fast 
parallel  algorithms  that  (1)  are  tighter  than  the  bounds  proved  in  Chapters  4  and  5  in 
some  cases  and  (2)  provide  limits  on  the  possibility  of  perfect  strong  scaling  (see  Table 
6.1  for  a  summary)  ,  and 

•  summarizing  further  extensions  of  the  lower  bound  results  and  proof  techniques. 

In  Section  6.1  we  define  the  “Strassen-like”  class  of  algorithms  and  show  how  the  analysis 
of  Chapter  5  can  be  used  to  obtain  similar  results  for  this  class  of  algorithms.  These  results 
extend  beyond  square  matrix  multiplication  algorithms  to  other  linear  algebra  computa¬ 
tions  as  well  as  rectangular  matrix  multiplication.  In  the  case  of  parallel  algorithms,  the 
lower  bounds  proved  in  the  previous  chapters  are  not  always  tight.  We  show  in  Section  6.2 
that  there  are  “memory-independent”  lower  bounds,  proved  using  similar  techniques  as  the 
bounds  that  do  depend  on  the  local  memory  size,  which  hold  independently  for  most  linear 
algebra  computations.  These  new  bounds,  when  applied  in  combination  with  the  memory- 
dependent  ones,  tighten  the  parallel  lower  bounds  for  a  wider  range  of  problem  sizes  (relative 
to  the  number  of  processors  and  local  memory  sizes).  In  Section  6.3  we  briefly  discuss  other 
extensions  to  the  lower  bound  analysis,  but  we  leave  the  details  to  the  references  given. 

Most  of  the  content  of  Section  6.1  appears  in  [26],  written  with  coauthors  James  Demmel, 
Olga  Holtz,  and  Oded  Schwartz.  The  exception  is  Section  6.1.3  which  appears  in  [29],  and 
Section  6.1.4  is  a  brief  summary  of  [21],  written  with  the  additional  coauthor  Benjamin 
Lipshitz.  The  results  of  Section  6.2  appear  in  [19]  (without  proof);  the  proofs  are  included 
in  the  technical  report  [22],  both  written  with  all  four  coauthors. 
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6.1  Strassen-like  Algorithms 

We  can  extend  the  bounds  for  Strassen’s  matrix  multiplication  to  a  wider  class  of  algorithms, 
namely  Strassen-like  algorithms. 

Definition  6.1.  A  Strassen-like  algorithm  is  a  recursive  algorithm  for  multiplying  square 1 
matrices  which  is  constructed  from  a  base  case  of  multiplying  n0  x  n0  matrices  using  m0  < 
scalar  multiplications,  resulting  in  an  algorithm  for  multiplying  n  x  n  matrices  requiring 
O(nwo)  flops  where  ojq  =  log„o  mo.  In  order  to  be  Strassen-like,  the  base  case  decoding  graph 
(referred  to  as  Dec\C  in  Section  5.2),  which  gives  the  dependencies  between  the  mo  scalar 
multiplication  results  and  the  n(  entries  of  the  output  matrix,  must  be  connected. 

Given  two  matrices  of  size  nx  n,  a  Strassen-like  algorithm  splits  them  into  n(  blocks  (each 
of  size  (n/n0)-by-(n/n0)),  and  works  block-wise,  according  to  the  base  algorithm.  Additions 
(and  subtractions)  in  the  base  algorithm  are  interpreted  as  additions  (and  subtractions)  of 
blocks;  multiplications  in  the  base  algorithm  are  interpreted  as  multiplications  of  blocks, 
which  are  performed  by  recursively  calling  the  algorithm.  The  arithmetic  count  of  the 
algorithm  is  then  T(n)  =  mo  •  T(n/no)  +  0(n 2),  so  T(n )  =  0(n“°)  where  w>o  =  log„()  mo- 
This  is  the  structure  of  nearly  all  the  fast  matrix  multiplication  algorithms  that  were 
obtained  since  Strassen’s  (see  [151]  for  a  summary  and  the  most  recent  results).  In  fact,  any 
0(nP°)  matrix  multiplication  algorithm  can  be  converted  into  a  recursive  matrix  multipli¬ 
cation  algorithm  of  running  time  O(nA0+e)  for  any  £  >  0  [124],  Furthermore,  the  algorithm 
can  be  made  numerically  stable  while  preserving  this  form  [59] . 

However,  to  be  considered  Strassen-like,  an  algorithm’s  computation  graph  must  also 
satisfy  a  technical  assumption  described  in  Section  6.1.1.  This  precludes  some  of  the  fast 
algorithms  (and  perhaps  others  whose  computation  graphs  have  not  been  explicitly  specified) 
as  well  as  the  classical  0(n3)  algorithm. 

6.1.1  Connected  Decoding  Graph  Assumption 

For  our  technique  to  work,  and  in  order  to  be  considered  Strassen-like,  we  demand  that  the 
Dec\C  part  of  the  computation  graph  is  a  connected  graph  (this  is  assumed  in  the  proof  of 
Claim  5.5  in  Section  5. 2. 1.2).  Thus  the  Strassen-like  class  includes  Winograd’s  variant  of 
Strassen’s  algorithm  [152],  which  uses  15  additions  rather  than  18.  The  Strassen-like  class 
does  not  contain  the  classical  algorithm,  where  Dec\C  is  composed  of  four  disconnected 
graphs  (corresponding  to  the  four  outputs).  We  believe  this  assumption  is  an  artifact  of  our 
proof  technique  and  unnecessary  for  the  same  lower  bounds  to  apply. 

6.1.2  Communication  Costs  of  Strassen-like  Algorithms 

To  prove  Theorem  6.3,  which  generalizes  the  bandwidth  cost  lower  bound  of  Strassen’s 
algorithm  (Theorem  5.10)  to  all  Strassen-like  algorithms,  we  note  the  following:  the  entire 

^ee  Section  6.1.4  for  extensions  to  fast  rectangular  matrix  multiplication  algorithms. 
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proof  of  Theorem  5.10,  and  in  particular,  the  computations  in  the  proof  of  Lemma  5.9, 
hold  for  any  Strassen-like  algorithm,  where  we  plug  in  n(j  and  m 0,  instead  of  4  and  7.  For 
bounding  the  asymptotic  bandwidth  cost,  we  do  not  care  about  the  number  of  internal 
vertices  of  Dec\C]  we  need  only  to  know  that  Dec\C  is  connected  (this  critical  technical 
assumption  is  used  in  the  proof  of  Claim  5.5),  and  to  know  the  sizes  n0  and  m q.  The  only 
nontrivial  adjustment  is  to  show  the  equivalent  of  Claim  5.3,  that  the  decoding  graph  is  of 
bounded  degree.  This  is  given  in  the  following  claim: 

Claim  6.2.  The  decoding  graph  of  any  Strassen-like  algorithm  is  of  degree  bounded  by  a 
constant  which  is  independent  of  the  input  size. 

Proof.  If  the  set  of  input  vertices  of  Dec\C,  and  the  set  of  its  output  vertices  are  disjoint, 
then  the  entire  decoding  graph  is  of  constant  bounded  degree  (its  maximal  degree  is  at  most 
twice  the  largest  degree  of  Dec\C ). 

Assume  (towards  contradiction)  that  the  base  graph  Dec\C  has  an  input  vertex  which 
is  also  an  output  vertex.  An  output  vertex  represents  the  inner  product  of  two  no-long 
vectors  (i.e.,  the  corresponding  row- vector  of  A  and  column  vector  of  B).  The  corresponding 
bilinear  polynomial  is  irreducible.  This  is  a  contradiction,  since  an  input  vertex  represents 
the  multiplication  of  a  (weighted)  sum  of  elements  of  A  with  a  (weighted)  sum  of  elements 
of  B.  □ 


Thus,  we  state  (without  formal  proof)  the  extensions  of  Theorem  5.10  and  Corollary  5.11 
to  Strassen-like  algorithms: 


Theorem  6.3.  Consider  a  recursive  Strassen-like  fast  matrix  multiplication  algorithm  with 
Ofn w°)  arithmetic  operations  implemented  on  a  sequential  machine  with  fast  memory  of  size 
M.  Then  for  M  <  n2 ,  and  assuming  no  recomputation,  the  bandwidth  cost  of  the  Strassen- 
like  algorithm  is 


Corollary  6.4.  Consider  a  Strassen-like  algorithm  implemented  on  a  parallel  machine  with 
P  processors,  each  with  a  local  memory  of  size  M.  There  exists  a  constant  c  such  that  for 
M  <  c  ■  n2 /P2N° ,  and  assuming  no  recomputation,  the  bandwidth  cost  of  the  Strassen-like 
algorithm  is 


W 


6.1.3  Fast  Linear  Algebra 

A  straightforward  extension  of  lower  bounds  for  matrix  multiplication  extend  to  computa¬ 
tions  that  involve  a  matrix  multiplication  subcomputation.  As  shown  in  [58],  many  linear 
algebra  computations  can  be  performed  recursively,  with  the  same  computational  complex¬ 
ity  as  the  matrix  multiplication  subroutine  they  call.  As  we  will  see  in  Section  7.2,  these 
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algorithms  also  attain  the  same  asymptotic  communication  costs  as  the  matrix  multiplica¬ 
tion  algorithm  they  employ.  The  following  lower  bound  applies  to  all  the  algorithms  in  [58] 
assuming  the  fast  matrix  multiplication  subroutine  is  Strassen-like. 

Corollary  6.5.  Suppose  an  algorithm  has  a  CD  AG  containing  as  a  subgraph  the  CD  AG  of 
a  Strassen-like  matrix  multiplication  algorithm  with  input  size  Q(n)  which  performs  Q(n“°) 
flops.  Then,  assuming  that  no  intermediate  value  is  computed  twice,  the  number  of  words 
moved  during  the  computation  on  a  machine  with  fast  memory  of  size  M  is 

Proof.  The  proof  of  Theorem  6.3  is  based  on  an  analysis  of  the  CDAG  of  a  Strassen-like 
matrix  multiplication  algorithm.  If  the  CDAG  of  a  computation  includes  as  a  subgraph 
a  CDAG  which  corresponds  to  O(n)  x  O(n)  Strassen-like  matrix  multiplication,  then  the 
analysis  yields  the  same  communication  lower  bound  for  that  subset  of  the  computation  and 
therefore  the  entire  computation.  □ 

Note  that  there  may  be  many  different  CDAGs  which  correspond  to  computing,  for 
instance,  an  LU  decomposition  using  a  Strassen-like  matrix  multiplication  as  a  subroutine. 
For  example,  the  algorithms  of  [58]  split  the  matrix  into  equal-sized  left  and  right  halves, 
but  another  algorithm  may  split  the  matrix  into  a  tall-skinny  panel  and  a  larger  trailing 
matrix.  Corollary  6.5  applies  to  all  such  algorithms  that  contain  a  sufficiently  large  subgraph 
corresponding  to  a  Strassen-like  matrix  multiplication. 

This  result  implies  that  given  the  CDAG  that  a  recursive  algorithm  of  [58]  produces, 
no  re-ordering  of  the  computation  can  improve  the  communication  costs  by  more  than  a 
constant  factor  compared  to  the  depth-first  ordering  given  by  the  recursive  algorithm.  The 
result  does  not  apply  to  algorithms  which  restructure  the  CDAG  beyond  the  freedom  allowed 
by  commutativity  and  associativity  of  addition.  See  Section  7.2  for  more  details. 

6.1.4  Fast  Rectangular  Matrix  Multiplication  Algorithms 

Many  fast  algorithms  have  been  devised  for  multiplication  of  rectangular  matrices  (see  [21]  for 
a  detailed  list).  A  fast  algorithm  for  multiplying  m o  x  no  and  no  x  po  matrices  in  q  <  monoPo 
scalar  multiplications  can  be  applied  recursively  to  multiply  mf  x  nf0  and  nf0  x  pg  matrices 
in  0(qt)  flops.  For  such  algorithms,  the  CDAG  has  very  similar  structure  to  Strassen  and 
Strassen-like  algorithms  for  square  multiplication  in  that  it  is  composed  of  two  encoding 
graphs  and  one  decoding  graph.  Assuming  that  the  decoding  graph  is  connected,  the  proofs 
of  Theorem  5.10  and  Lemma  5.9  apply  where  we  plug  in  moPo  and  q  for  4  and  7.  In  this 
case,  we  obtain  a  result  analogous  to  Theorem  5.10  which  states  that  the  bandwidth  cost  of 
such  an  algorithm  is  given  by 

Ml°Sm0P0  q~l ) 


w  =  n 
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If  the  output  matrix  is  the  largest  of  the  three  matrices  (he.,  n0  <  m0  and  n0  <  p0),  then 
this  lower  bound  is  tight  ( e.g .,  it  is  attained  by  the  natural  recursive  algorithm).  The  lower 
bound  extends  to  the  parallel  case  as  well,  analogous  to  Corollary  5.11. 

However,  in  the  case  that  the  decoding  graph  is  not  connected,  the  proof  does  not  apply, 
and  in  the  case  the  output  matrix  is  not  the  largest,  the  lower  bound  is  not  necessarily 
tight.  In  order  to  handle  these  technical  challenges,  we  can  employ  modifications  of  the 
proof  technique:  we  can  deal  with  the  decoding  graph  being  disconnected  by  considering 
individual  connected  components,  or  we  can  consider  one  of  the  two  encoding  graphs,  which 
may  contain  vertices  of  high  degree.  In  either  case,  the  proofs  must  be  adapted,  and  we 
obtain  slightly  weaker  results.  We  detail  these  approaches  in  [21]  and  discuss  the  application 
to  rectangular  matrix  multiplication  algorithms  of  [37]  and  [89]. 


6.2  Memory-Independent  Lower  Bounds 

The  lower  bounds  for  classical  linear  algebra  in  Chapter  4  and  Strassen’s  matrix  multiplica¬ 
tion  in  Chapter  5  each  depend  on  the  size  of  the  fast  or  local  memory,  M.  Since  M  appears 
in  the  denominator  in  both  cases,  these  bounds  suggest  that  as  more  local  memory  is  used, 
less  communication  is  necessary.  In  the  parallel  case,  especially  for  large  P,  there  may  be 
much  more  local  memory  available  than  what  is  required  to  store  the  inputs  and  outputs  of 
the  computation.  In  fact,  algorithms  discussed  in  Section  8.2  show  that  extra  local  memory 
(he.,  M  n2 / P)  can  be  used  effectively  to  reduce  communication. 

However,  there  is  a  limit  to  the  tradeoff  between  extra  memory  and  reduced  commu¬ 
nication.  Using  similar  proof  techniques  to  those  used  in  Chapters  4  and  5,  we  can  prove 
memory-independent  lower  bounds  (with  more  restrictive  assumptions  on  the  initial  data  lay¬ 
out  and  computational  load-balance)  that  begin  to  dominate  the  memory-dependent  lower 
bounds  for  certain  problem  sizes. 

In  addition  to  tightening  existing  bounds,  the  lower  bounds  in  this  section  yield  another 
interesting  conclusion  regarding  strong  scaling.  We  say  that  an  algorithm  exhibits  perfect 
strong  scaling  if  it  attains  running  time  on  P  processors  which  is  linear  in  1/P,  including 
all  communication  costs,  for  some  range  of  P.  For  example,  Cannon’s  parallel  matrix  mul¬ 
tiplication  algorithm  (see  Section  8.1.1)  has  a  parallel  computational  cost  of  0(n3/P)  flops 
but  a  bandwidth  cost  of  0(n2 / \[P)  words.  Thus,  Cannon’s  algorithm  scales  perfectly  with 
respect  to  the  computational  cost  but  not  with  respect  to  the  communication  cost.  While  it 
is  possible  for  classical  and  Strassen-based  matrix  multiplication  algorithms  to  strongly  scale 
perfectly,  the  communication  costs  restrict  the  strong  scaling  ranges  much  more  than  do  the 
computation  costs.  These  ranges  depend  on  the  problem  size  relative  to  the  local  memory 
size,  and  on  the  computational  complexity  of  the  algorithm. 

Interestingly,  in  both  cases  the  dominance  of  a  memory-independent  bound  arises,  and 
the  strong  scaling  range  ends,  exactly  when  the  memory-dependent  latency  lower  bound 
becomes  H(l).  Of  course,  since  the  latency  cost  cannot  possibly  drop  below  a  constant,  it  is 
an  immediate  result  of  the  memory-dependent  bounds  that  the  latency  cost  cannot  continue 
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to  strongly  scale  perfectly.  However,  for  sufficiently  large  problems,  the  bandwidth  cost 
typically  dominates  the  cost,  and  the  memory-independent  bandwidth  scaling  bounds  limit 
the  strong  scaling  of  matrix  multiplication  in  practice.  For  simplicity  we  omit  discussions 
of  latency  cost  in  this  section,  since  the  lower  bound  on  the  number  of  messages  is  always 
a  factor  of  M  below  the  bandwidth  cost  in  the  strong  scaling  range  and  is  always  constant 
outside  the  strong  scaling  range. 

While  the  main  arguments  in  this  section  focus  on  matrix  multiplication,  they  can  be 
generalized  to  other  algorithms,  including  other  three-nested-loops  computations  (see  Defi¬ 
nition  4.1)  and  other  Strassen-like  algorithms  (see  Definition  6.1).  See  Section  6.2.3  for  more 
details. 

6.2.1  Communication  Lower  Bounds 

6. 2. 1.1  Classical  Matrix  Multiplication 

In  this  section,  we  prove  a  memory-independent  lower  bound  for  classical  matrix  multiplica¬ 
tion  of  D(n2/P2/3)  words.  The  same  result  appears  elsewhere  in  the  literature,  under  slightly 
different  assumptions:  in  the  LPR.AM  model  [4],  where  no  data  exists  in  the  (unbounded) 
local  memories  at  the  start  of  the  algorithm;  in  the  distributed- memory  model  [95],  where 
the  local  memory  size  is  assumed  to  be  M  =  @(n2/P2/3);  and  in  the  distributed-memory 
model  [137],  where  the  algorithm  is  assumed  to  perform  a  certain  amount  of  input  replica¬ 
tion.  Our  bound  is  for  the  distributed  memory  model,  holds  for  any  M,  and  assumes  no 
specific  communication  pattern. 

Using  Lemma  2.7  (in  a  similar  way  to  Chapter  4  and  [95]),  we  can  describe  the  ratio 
between  the  number  of  scalar  multiplications  a  processor  performs  and  the  amount  of  data 
it  must  access. 

Lemma  6.6.  Suppose  a  processor  hasl  words  of  initial  data  at  the  start  of  an  algorithm,  per¬ 
forms  0(n3/P)  scalar  multiplications  within  classical  matrix  multiplication,  and  then  stores 
O  words  of  output  data  at  the  end  of  the  algorithm.  Then  the  processor  must  send  or  receive 
at  least  D(n2/P2/3)  —  X  —  O  words  during  the  execution  of  the  algorithm. 

Proof.  We  follow  the  proofs  of  Chapter  4  and  [95].  Consider  a  discrete  nxnxn  cube  where 
the  lattice  points  correspond  to  the  scalar  multiplications  within  the  matrix  multiplication 
A  ■  B  [i.e.,  lattice  point  [i,  j,  k )  corresponds  to  the  scalar  multiplication  aik  ■  hkj)-  Then  the 
three  pairs  of  faces  of  the  cube  correspond  to  the  two  input  and  one  output  matrices. 

The  projections  on  the  three  faces  correspond  to  the  input/output  elements  the  processor 
has  to  access  (and  must  communicate  if  they  are  not  in  its  local  memory).  By  Lemma  2.7,  and 

the  fact  that  ^lUr  •  \  Vy\  •  \VZ\  <  \j \VX\  +  \Vy\  +  |I4|)3,  the  number  of  words  the  processor 

must  access  is  at  least  \f6  |D|2//3  =  D(n2/P2/3).  Since  the  processor  starts  with  X  words  and 
ends  with  O  words,  the  result  follows.  □ 
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Theorem  6.7.  Suppose  a  parallel  algorithm  performing  classical  dense  matrix  multiplica¬ 
tion  begins  with  one  copy  of  the  input  matrices  and  minimizes  computational  costs  in  an 
asymptotic  sense.  Then,  for  sufficiently  large  P,  some  processor  has  a  bandwidth  cost  of 

w  =  n  (m)  ■ 

Proof.  At  the  end  of  the  algorithm,  every  element  of  the  output  matrix  must  be  fully  com¬ 
puted  and  exist  in  some  processor’s  local  memory  (though  multiple  copies  of  the  element  may 
exist  in  multiple  memories).  For  each  output  element,  we  designate  one  memory  location  as 
the  output  and  disregard  all  other  copies.  For  each  of  the  n2  designated  memory  locations, 
we  consider  the  n  scalar  multiplications  whose  results  were  used  to  compute  its  value  and 
disregard  all  other  redundantly  computed  scalar  multiplications. 

In  order  to  minimize  computational  costs  asymptotically,  the  running  time  for  classical 
dense  matrix  multiplication  must  be  0(n3/P).  This  is  possible  only  if  at  least  a  constant 
fraction  of  the  processors  perform  @(n3/P)  of  the  scalar  multiplications  corresponding  to 
designated  outputs. 

Since  there  exists  only  one  copy  of  the  input  matrices  and  designated  output-0  (n2)  words 
of  data-some  processor  which  performs  @(n3/P)  multiplications  must  start  and  end  with  no 
more  than  I  +  0  =  0(n2/P)  words  of  data.  Thus,  by  Lemma  6.6,  some  processor  must  read 
or  write  D(n2 / P2/3)  —  0(n2/P)  =  Ul{n2 / P2/3)  words  of  data.  □ 

Note  that  the  theorem  applies  to  any  P  >  2  with  a  strict  enough  assumption  on  the  load 
balance.  For  discussion  of  algorithms  attaining  this  bound,  see  Section  8.2.1. 

6.2. 1.2  Strassen’s  Matrix  Multiplication 

In  this  section,  we  prove  a  memory-independent  lower  bound  for  Strassen’s  matrix  multipli¬ 
cation  of  D(n2 / P2/1®7)  words.  We  reuse  notation  and  proof  techniques  from  Chapter  5.  By 
prohibiting  redundant  computations  we  mean  that  each  arithmetic  operation  is  computed  by 
exactly  one  processor.  This  is  necessary  for  interpreting  edge  expansion  as  communication 
cost. 

Theorem  6.8.  Suppose  a  parallel  algorithm  performing  Strassen’s  matrix  multiplication  min¬ 
imizes  computational  costs  in  an  asymptotic  sense  and  performs  no  redundant  computation. 
Then,  for  sufficiently  large  P,  some  processor  must  have  a  bandwidth  cost  of 

w = « (p^y  ■ 

Proof.  Recall  that  the  computation  DAG  of  Strassen’s  algorithm  multiplying  square  matrices 
A  ■  B  =  C  can  be  partitioned  into  three  subgraphs:  an  encoding  of  the  elements  of  A,  an 
encoding  of  the  elements  of  B,  and  a  decoding  of  the  scalar  multiplication  results  to  compute 
the  elements  of  C.  These  three  subgraphs  are  connected  by  edges  that  correspond  to  scalar 
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multiplications.  Call  the  third  subgraph  Dec\gnC ,  where  lg  n  is  the  number  of  levels  of 
recursion  for  matrices  of  dimension  n. 

In  order  to  minimize  computational  costs  asymptotically,  the  running  time  for  Strassen’s 
matrix  multiplication  must  be  0(nlg7/P).  Since  a  constant  fraction  of  the  flops  correspond 
to  vertices  in  Dec\gnC ,  this  is  possible  only  if  some  processor  performs  @(nlg7/P)  flops 
corresponding  to  vertices  in  Dec\gnC . 

By  Lemma  5.9,  the  edge  expansion  of  DeCkC  is  given  by  h(DeCkC)  =  0((4/T)fc) .  Using 
Claim  5.2,  we  deduce  that 


hs(Dec\gnC )  =  U 


(6.1) 


where  hs  is  the  edge  expansion  for  sets  of  size  at  most  s. 

Let  S  be  the  set  of  vertices  of  Dec\gnC  that  correspond  to  computations  performed  by 
the  given  processor.  Set  s  =  |S|  =  @(nlg7/P).  By  Equation  (6.1),  the  number  of  edges 
between  S  and  S  is 

(. 


_  /  77 

\E(S,  5)|  =  Q  (s  •  hs(DecignC))  =  Q 


and  because  Dec\gnC  is  of  bounded  degree  (Claim  5.3)  and  each  vertex  is  computed  by  only 
one  processor,  the  number  of  words  moved  is  0(| E(S,  5)|)  and  the  result  follows.  □ 

Note  that  the  theorem  applies  to  any  P  >  2  with  a  strict  enough  assumption  on  the  load 
balance  among  vertices  in  Dec\gnC  as  defined  in  the  proof.  For  details  of  the  algorithm  that 
attains  this  bound,  see  Chapter  11. 


6.2.2  Limits  of  Strong  Scaling 

In  this  section  we  present  limits  of  strong  scaling  of  matrix  multiplication  algorithms.  These 
are  immediate  implications  of  the  memory  independent  communication  lower  bounds  proved 
in  Section  6.2.1.  Roughly  speaking,  the  memory-dependent  communication  cost  lower  bound 
is  of  the  form  fl  (f(n,  M )  /  P)  for  both  classical  and  Strassen  matrix  multiplication  algorithms. 
However,  the  memory  independent  lower  bounds  are  of  the  form  H  (/(n)/Pc)  where  c  <  1  (see 
Table  6.1).  This  implies  that  strong  scaling  is  not  possible  when  the  memory-independent 
bound  dominates.  We  make  this  formal  below. 

Corollary  6.9.  Suppose  a  parallel  algorithm  performing  Strassen’s  matrix  multiplication 
minimizes  bandwidth  and  computational  costs  in  an  asymptotic  sense  and  performs  no  re¬ 
dundant  computation.  Then  the  algorithm  can  achieve  perfect  strong  scaling  only  for 


\M( ig7)/2  )  ' 
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Classical 

Strassen 

Memory-dependent 

ri  \ 

q  /  n‘s7  ^ 

lower  bound 

u [pVmJ 

“  \PM(1S7)/ 2-1  J 

Memory- independent 

o  (  n2  1 

Q  (  ri2  1 

lower  bound 

1  2  ^  p2/3  J 

“  ^pa/ig  r  J 

Perfect  strong 

p  —  Q  (  n3  \ 

p  —  n  (  ™lg  7  \ 

scaling  range 

1  ~  u  V  M3/2  ) 

1  ~  W  \M(lSP/2  J 

Table  6.1:  Bandwidth  cost  lower  bounds  for  matrix  multiplication  and  perfect  strong  scaling 
ranges.  The  classical  memory  dependent  bound  is  due  to  [95],  and  the  Strassen  memory 
dependent  bound  is  proved  in  Chapter  5.  The  memory-independent  bounds  are  proved  here, 
though  variants  of  the  classical  bound  appear  in  [4,  95,  137]. 


Proof.  By  Corollary  5.11,  any  parallel  algorithm  performing  matrix  multiplication  based  on 
Strassen  moves  at  least  fl(nlg7/PlbCls d/2-1)  words.  By  Theorem  6.8,  a  parallel  algorithm 
that  minimizes  computational  costs  and  performs  no  redundant  computation  moves  at  least 
fl(n2/P2/lg7)  words.  This  latter  bound  dominates  in  the  case  P  =  fl(nlg7/M^lg '^2).  Thus, 
while  a  communication-optimal  algorithm  will  strongly  scale  perfectly  up  to  this  threshold, 
after  the  threshold  the  communication  cost  will  scale  as  l/P2/1®7  rather  than  1/P.  □ 

Corollary  6.10.  Suppose  a  parallel  algorithm  performing  classical  dense  matrix  multiplica¬ 
tion  starts  and  ends  with  one  copy  of  the  data  and  minimizes  bandwidth  and  computational 
costs  in  an  asymptotic  sense.  Then  the  algorithm  can  achieve  perfect  strong  scaling  only  for 

p=°{w^)- 

Proof.  By  [95],  any  parallel  algorithm  performing  classical  matrix  multiplication  moves  at 
least  f2(n3/(P-\/M))  words.  By  Theorem  6.7,  a  parallel  algorithm  that  starts  and  ends  with 
one  copy  of  the  data  and  minimizes  computational  costs  moves  at  least  fl(n2/P2/3)  words. 
This  latter  bound  dominates  in  the  case  P  =  f l(n3/M3/2).  Thus,  while  a  communication- 
optimal  algorithm  will  strongly  scale  perfectly  up  to  this  threshold,  after  the  threshold  the 
communication  cost  will  scale  as  1/P2//'3  rather  than  1/P.  □ 

In  Figure  6.1  we  present  the  asymptotic  communication  costs  of  classical  and  Strassen- 
based  algorithms  for  a  fixed  problem  size  as  the  number  of  processors  increases.  Both  types 
of  perfectly  strong  scaling  algorithms  stop  scaling  perfectly  above  some  number  of  processors, 
which  depends  on  the  matrix  size  and  the  available  local  memory  size. 

Let  Pm in  =  0(n2/M)  be  the  minimum  number  of  processors  required  to  store  the  input 
and  output  matrices.  By  Corollaries  6.9  and  6.10  the  perfect  strong  scaling  range  is  Pmin  < 
P  <  Pnmx  where  Pmax  =  Q(P^)  in  the  classical  case  and  Pmax  =  0(P^fn7^2)  in  the  Strassen 
case.  Note  that  the  perfect  strong  scaling  range  is  larger  for  the  classical  case,  though  the 
communication  costs  are  higher.  Also  note  that  outside  the  perfect  strong  scaling  range, 
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Figure  6.1:  Bandwidth  cost  lower  bounds  and  strong  scaling  of  matrix  multiplication:  classi¬ 
cal  vs.  Strassen.  Horizontal  lines  correspond  to  perfect  strong  scaling.  Pmin  is  the  minimum 
number  of  processors  required  to  store  the  input  and  output  matrices. 


the  communication  costs  do  not  grow  linearly  as  the  scale  of  the  figure  seems  to  suggest.  In 
the  classical  case,  the  (Bandwidth  cost)xP  is  proportional  to  P1/3;  in  the  Strassen  case,  it 
is  proportional  to  P1-2/1®7  «  P29  outside  the  strong  scaling  range.  Note  that  in  both  the 
classical  and  Strassen  cases,  there  are  algorithms  attaining  these  lower  bounds. 

6.2.3  Extensions  of  Memory-Independent  Bounds 

The  memory-dependent  bound  of  classical  matrix  multiplication  of  [95]  is  generalized  in 
Chapter  4  to  algorithms  that  satisfy  Definition  4.1.  The  memory-independent  bound  of 
classical  matrix  multiplication  (Theorem  6.7)  applies  to  these  other  algorithms  as  well.  If 
the  algorithm  begins  with  one  copy  of  the  input  data  and  minimizes  computational  costs 
in  an  asymptotic  sense,  then,  for  sufficiently  large  P,  some  processor  must  send  or  receive 
at  least  Q  —  75 j  words,  where  G  is  the  total  number  of  gtjk  computations  and  D 

is  the  number  of  non-zeros  in  the  input  and  output.  The  proof  follows  that  of  Lemma  6.6 
and  Theorem  6.7,  setting  |E|  =  G  (instead  of  n3),  replacing  n3 / P  with  G/P  ,  and  setting 
I  +  O  —  0(D/P)  (instead  of  0(n2/P)). 

The  memory-independent  bound  and  perfect  strong  scaling  bound  of  Strassen’s  matrix 
multiplication  (Theorem  6.8  and  Corollary  6.9)  apply  to  other  Strassen-like  algorithms,  as 
defined  in  Section  6.1,  with  (the  exponent  of  the  total  arithmetic  count)  replacing  lg  7, 
provided  that  the  decoding  graph  is  connected.  The  proof  follows  that  of  Theorem  6.8  and 
of  Corollary  6.9,  but  uses  Claim  6.2  instead  of  Claim  5.3  and  replaces  Lemma  5.9  with  its 
extension. 
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6.3  Other  Extensions 

There  are  several  more  extensions  of  the  lower  bounds  presented  in  this  and  the  previous 
chapters  and  many  open  problems.  In  this  section,  we  point  out  a  few  of  these  directions 
and  the  corresponding  references. 

6.3.1  /c-Nested-Loops  Computations 

The  heart  of  the  argument  made  in  Chapter  4  is  based  on  relating  a  three-dimensional  set 
of  computation  to  two-dimensional  data  sets  using  the  geometric  inequality  of  Loomis  and 
Whitney  [107]  (see  Lemma  2.7).  Since  most  linear  algebra  computations  are  three  nested 
loops,  this  geometric  relationship  is  sufficient  for  the  algorithms  considered  here.  Consider 
instead  performing  an  “IV-body”  calculation,  where  we  wish  to  compute  all  the  pairwise 
interactions  within  a  set  of  N  particles.  In  this  case,  the  data  is  one  dimensional  (a  list 
of  particles)  and  the  computation  is  two  dimensional  (all  N2  pairwise  interactions).  Thus, 
Lemma  2.7,  which  is  based  on  three-dimensional  lattice  points,  no  longer  directly  relates  a 
subset  of  the  computation  to  the  data  involved.  However,  the  more  general  version  of  the 
result  that  appears  in  [107]  can  relate  computation  to  communication  for  IV-body  calculation: 
it  relates  a  d-dimensional  volume  to  its  projections  onto  (d  —  l)-dimensional  subspaces. 

If,  on  the  other  hand,  a  d-dimensional  computation  accesses  some  data  of  dimension  d  — 2, 
some  of  d  — 3,  and  some  of  d  — 1,  the  Loomis- Whitney  inequality  is  no  longer  helpful.  A  recent 
generalization  of  the  Loomis- Whitney  inequality  [34]  can  be  used  to  prove  communication 
lower  bounds  for  such  computations  (and  many  more).  The  statement  of  the  more  general 
inequality  and  its  implications  on  communication  costs  for  a  wide  class  of  algorithms  are 
given  in  [51].  In  the  paper,  the  authors  prove  communication  bounds  for  algorithms  that 
have  arbitrary  numbers  of  loops  and  access  arrays  with  arbitrary  dimensions,  as  long  as  the 
index  expressions  are  affine  combinations  of  loop  variables. 

6.3.2  Sparse  Matrix-Matrix  Multiplication 

The  lower  bounds  given  in  Chapter  4  hold  for  both  dense  and  sparse  computations.  How¬ 
ever,  particularly  in  the  sparse  case,  the  lower  bounds  may  not  be  tight  (be.,  attainable). 
Although  the  focus  of  this  dissertation  is  dense  linear  algebra,  consider  the  multiplication  of 
two  sparse  matrices.  Corollary  4.3  effectively  upper  bounds  the  number  of  possible  useful 
computations  for  every  word  of  data  read  into  fast/local  memory  at  0(\fM).  However,  for 
matrices  which  are  very  sparse,  elements  of  an  input  matrix  may  be  involved  in  far  fewer 
than  0(\fM)  computations,  making  that  amount  of  data  reuse  unattainable.  Even  the 
memory-independent  bounds  described  in  Section  6.2.3  can  be  unattainable. 

Using  assumptions  on  the  type  of  sparsity  in  the  input  matrices  and  properties  particular 
to  the  computation,  we  can  prove  a  tighter  (and  attainable)  lower  bound  for  sparse  matrix- 
matrix  multiplication  [16] .  The  lower  bound  proof  resembles  the  memory-independent  bound 
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proof  of  Section  6.2. 1.1  but  also  depends  on  the  randomness  of  the  input  matrices  (and  proves 
results  that  hold  only  in  expectation). 


84 


Part  II 

Algorithms  and  Communication  Cost 

Analysis 


85 


Chapter  7 

Sequential  Algorithms  and  their 
Communication  Costs 


The  lower  bound  results  of  Chapters  3-6  provide  targets  for  algorithmic  development;  in  this 
chapter  we  focus  on  algorithms  for  sequential  machines  and  discuss  the  current  state-of-the- 
art  in  terms  of  communication  costs.  The  main  contribution  of  this  chapter  is  to  provide 
a  comprehensive  (though  certainly  not  exhaustive)  summary  of  communication-optimal  se¬ 
quential  algorithms  for  dense  linear  algebra  computations  and  provide  references  to  papers 
that  provide  algorithmic  details,  communication  cost  analyses,  and  demonstrated  perfor¬ 
mance  improvements.  We  consider  all  of  the  fundamental  computations — BLAS,  Cholesky 
and  symmetric-indefinite  factorizations,  LU  and  QR  decompositions,  eigenvalue  and  singular 
value  decompositions — and  compare  the  best  sequential  algorithms  with  the  lower  bounds 
established  in  Chapter  4.  Chapter  8  summarizes  parallel  algorithms,  and  Chapters  9-11 
focus  on  algorithms  for  particular  computations. 

Recall  the  sequential  two-level  memory  model  presented  in  Section  2.2.1.  We  consider 
communication  between  a  fast  memory  of  size  M  and  a  slow  memory  of  unbounded  size, 
and  we  track  both  the  number  of  words  and  messages  that  an  algorithm  moves.  Because 
a  message  requires  its  words  to  be  stored  contiguously  in  slow  memory,  we  must  specify 
the  matrix  data  layout  in  determining  latency  costs  (see  Section  2.3.1  for  details  on  data 
layouts).  We  also  consider  the  multiple-level  memory  hierarchy  model  in  this  chapter,  as  it 
more  accurately  reflects  today’s  machines. 

One  may  imagine  that  sequential  algorithms  that  minimize  communication  for  any  num¬ 
ber  of  levels  of  a  memory  hierarchy  (see  Section  2. 2. 1.2)  might  be  very  complex,  possibly 
depending  not  just  on  the  number  of  levels,  but  also  their  sizes.  In  this  context,  it  is  worth 
distinguishing  a  class  of  algorithms,  called  cache  oblivious  [70],  that  can  minimize  communi¬ 
cation  between  all  levels  (at  least  asymptotically)  independent  of  the  number  of  levels  and 
their  sizes.  These  algorithms  are  recursive,  and  provided  a  matching  recursive  layout  is  used, 
these  algorithms  may  also  minimize  the  number  of  messages  independent  of  the  number  of 
levels  of  memory  hierarchy.  Not  only  do  cache-oblivious  algorithms  perform  well  in  theory, 
but  they  can  also  be  adapted  to  perform  well  in  practice  (see  [156],  for  example). 
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The  rest  of  this  chapter  is  divided  into  two  sections,  classical  and  fast  linear  algebra 
computations.  In  Section  7.1,  we  present  Table  7.1  with  references  to  the  communication- 
optimal  algorithms  for  the  most  fundamental  dense  linear  algebra  computations  and  then 
discuss  each  computation  and  the  corresponding  state-of-the-art  algorithms  in  turn.  In 
Section  7.2,  we  highlight  extensions  of  fast  matrix  multiplication  algorithms  to  other  dense 
linear  algebra  computations. 

The  main  contributions  to  the  set  of  communication-optimal  sequential  algorithms,  ap¬ 
pearing  either  in  this  thesis  or  in  manuscripts  written  with  various  sets  of  coauthors,  are  as 
follows: 

•  communication  cost  analysis  for  Cholesky  decomposition  algorithms  [24], 

•  first  sequential  communication-optimal  algorithm  for  symmetric-indefinite  factoriza¬ 
tion  [14]  (see  Chapter  9), 

•  first  sequential  cache-oblivious  algorithm  for  LU  decomposition  to  minimize  latency 
costs  (similar  algorithm  for  QR  decomposition  is  also  communication  optimal)  [32], 

•  new  sequential  communication-avoiding  algorithms  for  the  symmetric  eigendecompo- 
sition  and  SVD  [30]  (see  Chapter  10), 

•  first  sequential  communication-optimal  algorithm  for  the  nonsymmetric  eigendecom- 
position  [17],  and 

•  communication  cost  analysis  for  fast  algorithms  for  linear  algebra  computations  [29]. 

Table  7.1  is  an  updated  version  of  Table  6.1  in  [28],  and  some  of  the  discussion  here  is 
based  on  that  paper,  written  with  coauthors  James  Dcmmcl,  Olga  Holtz,  and  Oded  Schwartz. 
The  material  on  Cholesky  decomposition  in  Section  7.1.2  is  based  on  [24],  written  with  the 
same  set  of  coauthors,  and  the  material  on  LU  decomposition  in  Section  7.1.4  also  appears  in 
[32],  written  with  coauthors  James  Demmcl,  Benjamin  Lipshitz,  Oded  Schwartz,  and  Sivan 
Toledo. 


7.1  Classical  Linear  Algebra 

In  this  section,  we  discuss  the  current  state-of-the-art  for  algorithms  that  perform  the  clas¬ 
sical  0(n3)  computations  on  sequential  machines.  For  each  of  the  computations  considered 
here,  we  can  compare  the  communication  costs  of  the  algorithms  to  the  lower  bounds  pre¬ 
sented  in  Chapter  4.  Table  7.1  summarizes  the  communication-optimal  classical  algorithms 
for  the  most  fundamental  dense  linear  algebra  computations.  We  differentiate  between  algo¬ 
rithms  that  minimize  communication  only  in  the  two-level  model  and  those  that  are  optimal 
also  in  a  multiple-level  memory  hierarchy.  We  also  differentiate  between  algorithms  that 
minimize  only  bandwidth  costs  and  those  that  minimize  both  bandwidth  and  latency  costs. 
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Computation 

Two  Levels  of  Memory 

Multiple  Levels  of  Memory 

Minimizes 

Words 

Minimizes 

Messages 

Minimizes 

Words 

Minimizes 

Messages 

BLAS-3 

[32,  70] 

[32,  70] 

Cholcsky 

[7,  8,  24,  83] 

[7,  24,  83] 

[7,  24,  83] 

Symmetric 

Indefinite 

[14] 

[14] 

[14] 

[14] 

LU 

[32,  80,  83,  142] 

[32,  80] 

[32,  83,  142] 

[32] 

QR 

[32,  62,  66,  69] 

[32,  62,  69] 

[32,  66,  69] 

[32,  69] 

Sym  Eig 
and  SVD 

[17,  30] 

[17] 

Nonsyrn  Eig 

[17] 

[17] 

Table  7.1:  Sequential  classical  algorithms  attaining  communication  lower  bounds.  We  sep¬ 
arately  list  algorithms  that  attain  the  lower  bounds  for  two  levels  and  multiple  levels  of 
memory  hierarchy.  In  each  of  these  cases,  we  separately  list  algorithms  that  minimize  only 
the  number  of  words  moved  and  algorithms  that  also  minimize  the  number  of  messages. 


In  order  for  an  algorithm  to  be  considered  communication-optimal  in  the  sequential 
model,  we  require  that  its  communication  costs  be  within  a  constant  factor  of  the  corre¬ 
sponding  lower  bound  and  that  it  performs  no  more  than  a  constant  factor  more  compu¬ 
tation  than  alternative  algorithms.  Most  of  the  algorithms  have  the  same  leading  constant 
in  computational  cost  as  the  standard  algorithms,  though  we  note  where  constant  factor 
increases  exist.  In  some  cases,  there  exists  a  small  range  of  matrix  dimensions  where  the 
algorithm  is  suboptimal;  the  communication  cost  includes  a  term  that  exceeds  the  lower 
bound  and  is  not  always  lower  order  in  this  range.  For  example,  the  rectangular  recursive 
algorithms  of  [66,  83,  142]  are  suboptimal  with  respect  to  bandwidth  cost  when  n  satisfies 
n/log  n  <C  y/M  <C  n  [32],  We  omit  these  details  here  for  sufficiently  small  ranges. 

We  emphasize  that  only  a  few  of  the  communication-optimal  algorithms  referenced  here 
are  included  in  standard  libraries  like  LAPACK.  While  this  chapter  focuses  on  asymptotic 
complexity  rather  than  measured  performance  on  current  architectures,  many  of  the  papers 
referenced  for  algorithms  here  also  include  performance  data  and  demonstrate  significant 
speedups  over  asymptotically  suboptimal  alternatives.  Our  communal  goal  is  to  eventually 
make  all  of  these  algorithms  available  via  widely  used  libraries.  There  is  a  very  large  body 
of  work  on  many  of  these  algorithms,  and  we  do  not  pretend  to  have  a  complete  list  of 
citations.  Instead  we  refer  just  to  papers  where  these  algorithms  first  appeared  (to  the 
best  of  our  knowledge),  with  or  without  analysis  of  their  communication  costs,  or  to  survey 
papers. 
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7.1.1  BLAS  Computations 

While  the  lower  bounds  given  in  Section  4. 1.2.1  apply  to  all  BLAS  computations,  only  the 
BLAS-3  computations  have  algorithms  that  attain  them.  In  the  case  of  BLAS-2  and  BLAS-1 
computations,  the  arithmetic  intensity  (i.e.,  ratio  of  computation  to  data)  is  0(1),  so  it  is 
impossible  for  the  bandwidth  cost  to  be  a  factor  of  0{\fM)  smaller  than  the  arithmetic  cost, 
assuming  the  data  has  to  be  read  from  slow  memory.  In  other  words,  there  exist  tighter 
lower  bounds  for  BLAS-2  and  BLAS-1  computations  based  on  the  size  of  the  inputs  and 
outputs,  so  the  lower  bounds  of  Section  4. 1.2.1  are  valid  but  unattainable. 

For  BLAS-3  computations,  blocked  versions  of  the  naive  algorithms  attain  the  lower 
bound  in  the  two-level  memory  model  when  the  block  size  size  is  chosen  to  be  0  ( \/ M ) 
(see  for  example  the  BLOCK-MULT  algorithm  in  [70]  for  matrix  multiplication).  In  order 
to  attain  the  corresponding  latency  cost  lower  bound,  a  block-contiguous  data  structure  is 
necessary  so  that  every  block  computation  involves  contiguous  chunks  of  memory.  Further¬ 
more,  because  of  the  self-similarity  of  matrices  and  these  fundamental  computations,  the 
block  computations  can  themselves  be  blocked.  Using  a  nested  level  of  blocking  for  each 
level  of  memory  (and  choosing  the  block  sizes  appropriately),  these  algorithms  can  mini¬ 
mize  communication  between  every  pair  of  successive  levels  in  a  memory  hierarchy.  Note 
that  a  matching  hierarchical  block-contiguous  data  structure  is  needed  to  minimize  latency 
costs.  We  do  not  include  a  reference  in  Table  7.1,  as  these  blocked  algorithms  are  generally 
considered  folklore. 

In  addition  to  the  explicitly  blocked  algorithms,  there  are  recursive  algorithms  for  all  of 
the  BLAS-3  computations.  As  explained  in  [70]  for  rectangular  matrix  multiplication,  these 
recursive  algorithms  also  attain  the  lower  bounds  of  Section  4. 1.2.1.  In  order  to  minimize 
latency  costs,  we  use  a  matching  recursive  data  layout,  like  the  rectangular  recursive  layout 
of  [32]  which  matches  the  R.EC-MULT  algorithm  of  [70].  For  computations  involving  square 
matrices,  data  layouts  based  on  Morton  orderings  and  its  variants  help  minimize  latency 
costs.  For  recursive  algorithms  for  triangular  solve,  see  for  example  [24,  Algorithm  3],  where 
the  right  hand  sides  form  a  square  matrix,  or  [32,  Algorithm  5]  for  the  general  rectangu¬ 
lar  case.  Similar  algorithms  exist  for  symmetric  and  triangular  matrix  multiplications  and 
symmetric  rank- A;  updates.  Because  these  algorithm  are  recursive  and  cache  oblivious,  they 
minimize  communication  costs  between  every  pair  of  memory  levels  in  a  hierarchy. 

7.1.2  Cholesky  Decomposition 

The  Cholesky  decomposition  is  used  primarily  for  solving  symmetric,  positive-definite  linear 
systems  of  equations.  Because  the  computation  inherits  numerical  stability  properties  from 
the  matrices  to  which  it  is  applied,  it  enjoys  a  freedom  in  algorithmic  design  (no  pivoting 
is  required),  and  communication-optimal  algorithms  are  well  known.  For  a  more  complete 
discussion  of  sequential  algorithms  for  Cholesky  decomposition  and  their  communication 
properties,  see  [24], 
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The  reference  implementation  in  LAPACK  [8]  (potrf)  is  a  blocked  algorithm,  and  by 
choosing  the  block  size  to  be  Q(y/M),  the  algorithm  attains  the  lower  bound  of  Corollary 
4.6.  As  in  the  case  of  the  BLAS  computations,  a  block  contiguous  data  structure  is  used 
to  obtain  the  latency  cost  lower  bound.  An  algorithm  with  nested  levels  of  blocking  and  a 
matching  data  layout  can  minimize  communication  for  multiple  levels  of  memory. 

A  recursive  algorithm  for  Cholesky  decomposition  was  first  proposed  in  [83]  and  later 
matched  with  a  block- recursive  data  structure  in  [7].  We  present  the  communication  cost 
analysis  in  [24],  where  the  algorithm  is  shown  to  be  communication  optimal  and  cache 
oblivious  as  long  as  cache-oblivious  subroutines  are  used.  Thus,  the  recursive  algorithm  is 
optimal  for  both  two-level  and  multiple-level  memory  models. 

Note  also  that  for  the  Cholesky  factorization  of  sparse  matrices  whose  sparsity  structure 
satisfy  certain  graph-theoretic  conditions  (having  “good  separators”),  the  lower  bound  of 
Corollary  4.6  can  also  be  attained  [79].  For  general  sparse  matrices,  the  problem  is  open. 

7.1.3  Symmetric-Indefinite  Decompositions 

If  the  linear  system  is  symmetric  but  not  positive  definite,  then  pivoting  is  required  to  ensure 
numerical  stability.  The  need  for  pivoting  complicates  the  computation,  and  communication- 
efficient  algorithms  are  not  as  straightforward.  A  brief  history  of  symmetric-indefinite  fac¬ 
torizations  and  their  communication  costs  is  given  at  the  beginning  of  Chapter  9.  The  most 
commonly  used  factorization  (and  the  one  implemented  in  LAPACK)  is  LDL1 ,  where  D  is 
block  diagonal  with  2x2  and  lxl  blocks.  An  alternative  factorization  is  due  to  Aasen  [1] 
and  computes  a  tri diagonal  matrix  T  instead  of  a  block  diagonal  matrix  D.  Both  factoriza¬ 
tions  use  symmetric  pivoting.  The  same  lower  bound  for  dense  matrices,  given  in  Corollaries 
4.13  and  4.14,  applies  to  both  computations. 

However,  no  current  communication-optimal  algorithms  exist  that  compute  these  factor¬ 
izations  directly.  The  implementations  of  LDL 1  in  LAPACK  [8]  (sytrf)  and  of  LTLT  in 
[125]  can  attain  the  lower  bound  for  large  matrices  (where  n  >  M)  but  fail  for  reasonably 
sized  matrices  (where  \[M  <  n  <  M).  They  also  never  attain  the  latency  cost  lower  bound, 
and  work  only  for  the  two-level  memory  model.  It  is  an  open  problem  whether  there  exists 
a  communication-optimal  algorithm  that  computes  the  factorizations  directly. 

The  communication-optimal  algorithm  presented  in  Chapter  9  and  [14]  first  computes 
a  factorization  LTLT  where  T  is  a  symmetric  band  matrix  (with  bandwidth  0 [y/M))  and 
then  decomposes  T  in  a  second  step.  This  algorithm  is  a  block  algorithm,  and  with  a  block 
contiguous  data  structure,  it  minimizes  both  words  and  messages  in  the  two-level  memory 
model. 

Because  the  subroutines  in  the  communication-avoiding  symmetric-indefinite  factoriza¬ 
tion  can  all  be  performed  with  blocked  or  recursive  algorithms  themselves,  it  is  possible  to 
extend  the  algorithm  (with  matching  data  structure)  to  minimize  communication  costs  in  the 
multiple-level  memory  model.  Note  that  our  Shape-Morphing  LU  algorithm  [32]  is  necessary 
to  perform  the  panel  factorization  subroutine  with  optimal  latency  cost  for  all  subsequent 
levels  of  memory. 


CHAPTER  7.  SEQUENTIAL  ALGORITHMS  AND  COMMUNICATION  COSTS  90 


7.1.4  LU  Decomposition 

For  general,  nonsymmetric  linear  systems,  an  LU  decomposition  is  the  direct  method  of 
choice.  As  in  the  case  of  symmetric-indefinite  systems,  pivoting  is  required  to  maintain 
numerical  stability.  For  performance  reasons  (and  because  it  is  generally  sufficient  in  prac¬ 
tice),  we  consider  here  performing  only  row  interchanges.  We  leave  consideration  of  comm¬ 
unication-optimal  complete-pivoting  algorithms  (those  that  perform  both  row  and  column 
interchanges)  to  future  work  (see  [61,  Section  5]  for  a  possible  approach).  The  content  of 
this  section  also  appears  in  [32,  Section  6]. 

There  is  a  long  history  of  algorithmic  innovation  to  reduce  communication  costs  for 
LU  factorizations.  Table  1  in  [32]  highlights  several  of  the  innovations  and  compares  the 
asymptotic  communication  costs  of  the  algorithms  discussed  here. 

The  LLI  decomposition  algorithm  in  LAPACK  (getrf )  is  based  on  “blocking”  in  order  to 
cast  much  of  the  work  in  terms  of  matrix-matrix  multiplication  rather  than  working  column- 
by-column  and  performing  most  of  the  work  as  matrix-vector  operations.  The  algorithm  is  a 
right-looking,  blocked  algorithm,  and  by  choosing  the  right  block  size,  the  algorithm  asymp¬ 
totically  reduces  the  communication  costs  compared  to  the  column-by-column  algorithm.  In 
fact,  for  very  large  matrices  ( m,  n  >  M)  it  can  attain  the  communication  lower  bound  (see 
Corollary  4.5).  However,  for  reasonably  sized  matrices  (m,n  <  M)  the  blocked  algorithm  is 
sub-optimal  with  respect  to  its  communication  costs. 

In  the  late  1990s,  both  Toledo  [142]  and  Gustavson  [83]  independently  showed  that  using 
recursive  algorithms  can  reduce  communication  costs.  The  analysis  in  [142]  shows  that 
the  recursive  LLI  (RLU)  algorithm  moves  asymptotically  fewer  words  than  the  algorithm 
in  LAPACK  when  m  <  M  (though  latency  cost  is  not  considered  in  that  work).  In  fact, 
the  RLLI  algorithm  attains  the  bandwidth  cost  lower  bounds.  Furthermore,  RLLI  is  cache 
oblivious,  so  it  minimizes  bandwidth  cost  for  any  fast  memory  size  and  between  any  pair  of 
successive  levels  of  a  memory  hierarchy. 

Motivated  by  the  growing  latency  cost  on  both  sequential  and  parallel  machines,  Grig¬ 
ori,  Demmcl,  and  Xiang  [80]  considered  bandwidth  and  latency  cost  metrics  and  presented 
an  algorithm  called  Communication-Avoiding  LLI  (CALU)  that  minimizes  both.  In  or¬ 
der  to  attain  the  lower  bound  for  latency  cost  (proved  in  that  paper  via  reduction  from 
matrix  multiplication-see  Section  3.2.1),  the  authors  used  the  block-contiguous  layout  and 
introduced  tournament  pivoting  as  a  new  and  different  scheme  than  partial  pivoting.  The 
tournament  pivoting  scheme  makes  different  pivoting  choices  than  partial  pivoting  and  is  the¬ 
oretically  less  stable  (though  the  two  schemes  are  equivalent  in  a  weak  sense  and  have  similar 
characteristics  in  practice  [80]).  The  drawbacks  to  CALLI  are  that  it  requires  knowledge  of 
the  fast  memory  size  for  both  algorithm  and  data  layout  (he.,  it  is  not  cache  oblivious),  and 
that,  because  of  its  youth,  tournament  pivoting  does  not  enjoy  the  same  confidence  from  the 
numerical  community  as  partial  pivoting  (see  [144],  for  example). 

Making  the  RLLI  algorithm  latency  optimal  had  been  an  open  problem  for  a  few  years. 
For  example,  arguments  are  made  in  [24]  and  [80]  that  RLLI  is  not  latency  optimal  for  several 
different  fixed  data  layouts.  In  [32],  using  a  technique  called  “shape-morphing,”  we  show 
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that  attaining  communication  optimality,  being  cache  oblivious,  and  using  partial  pivoting 
are  all  simultaneously  achievable. 

7.1.5  QR  Decomposition 

The  QR  decomposition  is  commonly  used  for  solving  least  squares  problem,  but  because  of 
its  numerical  stability,  it  has  applications  in  many  other  computations  such  as  eigenvalue 
and  singular  value  decompositions  and  many  iterative  methods.  While  there  are  several 
approaches  to  computing  a  QR  factorization,  including  Gram-Schmidt  orthogonalization  and 
Cholesky-QR,  we  focus  in  this  section  on  those  approaches  that  use  a  sequence  of  orthogonal 
transformations  (i.e.,  Householder  transformations  or  Givens  rotations)  because  they  are  the 
most  numerically  stable. 

The  history  of  reducing  communication  costs  for  QR  decomposition  is  very  similar  to 
that  of  LU.  The  algorithm  in  LAPACK  (geqrf)  is  based  on  the  Householder  QR  algorithm 
(see  [57,  Algorithm  3.2]  for  example),  computes  one  Householder  vector  per  column,  and  also 
uses  blocking  to  cast  most  of  the  computation  in  terms  of  matrix  multiplication.  However, 
this  form  of  blocking  is  more  complicated  than  in  the  case  of  LU  and  is  based  on  the  ideas 
of  [42,  131].  While  the  blocking  requires  extra  flops,  the  cost  is  only  a  lower  order  term. 
Though  the  algorithm  in  LAPACK  is  much  more  communication-efficient  than  the  column- 
by-column  approach,  it  still  does  not  minimize  bandwidth  cost  for  reasonably  sized  matrices 
(y/M  <  n  <  M),  and  the  column-major  data  layout  prevents  latency  optimality.  Note  that 
the  representation  of  the  Q  factor  is  compactly  stored  as  the  set  of  Householder  vectors  used 
to  triangularize  the  matrix  (i.e.,  one  Householder  vector  per  column  of  the  matrix). 

Shortly  after  the  rectangular-recursive  algorithms  for  LLI  were  developed,  a  similar  algo¬ 
rithm  for  QR  was  devised  in  [66].  As  in  the  case  of  LU,  this  algorithm  is  cache  oblivious  and 
minimizes  words  moved  (but  not  necessarily  messages).  R  also  computes  one  Householder 
vector  per  column.  However,  the  algorithm  performs  a  constant  factor  more  flops  than 
Householder  QR,  requiring  about  17%  more  arithmetic  for  tall-skinny  matrices  and  about 
30%  more  for  square  matrices.  To  address  this  issue,  the  authors  present  a  hybrid  algorithm 
which  combines  the  ideas  of  the  algorithm  in  LAPACK  and  the  rectangular-recursive  one. 
The  hybrid  algorithm  involves  a  parameter  that  must  be  chosen  correctly  (relative  to  the 
fast  memory  size)  in  order  to  minimize  communication,  so  it  is  no  longer  cache  oblivious. 

Later,  another  recursive  algorithm  for  QR  was  developed  in  [69].  The  recursive  structure 
of  the  algorithm  involves  splitting  the  matrix  into  quadrants  instead  of  left  and  right  halves, 
more  similar  to  the  recursive  Cholesky  algorithm  than  the  previous  rectangular-recursive 
LLI  and  QR  algorithms.  Because  recursive  calls  always  involve  matrix  quadrants,  the  algo¬ 
rithm  maps  perfectly  to  the  block-recursive  data  layout.  Indeed,  with  this  data  layout,  the 
algorithm  minimizes  both  words  and  messages  and  is  cache  oblivious.  Unfortunately,  the 
algorithm  involves  forming  explicit  orthogonal  matrices  rather  than  working  with  their  com¬ 
pact  representations,  which  ultimately  results  in  a  constant  factor  increase  in  the  number  of 
flops  of  about  3x.  It  is  an  open  question  whether  this  algorithm  can  be  modified  to  reduce 
this  computational  overhead. 
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At  nearly  the  same  time  as  the  development  of  the  CALU  algorithm  and  tournament 
pivoting,  a  similar  blocked  approach  for  QR  decomposition,  called  communication-avoiding 
QR  (CAQR)  was  designed  in  [62],  CAQR  maps  to  the  block-contiguous  data  layout  and 
minimizes  both  words  and  messages  in  the  two-level  model,  but  because  it  requires  the 
algorithmic  and  data  layout  block  size  to  be  chosen  correctly,  it  is  not  cache  oblivious.  In¬ 
terestingly,  it  also  requires  a  new  representation  of  the  Q  factor.  While  just  as  compact  as  in 
the  conventional  Householder  QR,  the  new  representation  varies  with  an  internal  character¬ 
istic  of  the  algorithm  (he.,  the  shape  of  the  reduction  trees).  The  ideas  behind  CAQR  first 
appear  in  [75],  and  include  [48,  82,  66];  see  [62]  for  a  more  complete  list  of  references.  Note 
that  the  CAQR  algorithm  satisfies  the  assumptions  of  Theorem  4.25 — it  maintains  forward 
progress  and  needs  not  compute  T  matrices  of  dimension  2  or  greater — and  attains  both 
the  bandwidth  cost  lower  bound  stated  in  the  theorem  as  well  as  the  latency  lower  bound 
corollary. 

As  explained  in  [32],  our  shape-morphing  technique  can  be  applied  to  the  rectangular- 
recursive  QR  algorithm  of  [66]  to  obtain  similar  results  as  in  the  case  of  LU  decomposition. 
Shape-Morphing  QR  is  both  communication  optimal  and  cache  oblivious,  though  it  suffers 
from  the  same  increase  in  computational  cost  as  the  original  rectangular-recursive  algorithm. 
Again,  a  hybrid  version  reduces  the  flops  at  the  expense  of  losing  cache  obliviousness. 

We  also  note  that  rank-revealing  QR  is  an  important  variant  of  QR  decomposition.  While 
the  conventional  QR  with  column  pivoting  approach  suffers  from  high  communication  costs, 
there  do  exist  communication-optimal  algorithms  for  this  computation.  See  [61]  for  an  ap¬ 
plication  of  the  tournament  pivoting  idea  of  CALU  to  column  pivoting  within  rank-revealing 
QR,  and  see  [17,  Algorithm  1]  for  a  randomized  rank-revealing  QR  algorithm  that  requires 
efficient  (non-rank-revealing)  QR  decomposition  and  matrix  multiplication  subroutines. 

7.1.6  Symmetric  Eigendecomposition  and  SVD 

The  processes  for  determining  the  eigenvalues  and  eigenvectors  of  a  symmetric  matrix  and 
the  singular  values  and  singular  vectors  of  a  nonsymmetric  matrix  are  computationally  sim¬ 
ilar.  In  both  cases,  the  standard  approach  is  to  reduce  the  matrix  via  two-sided  orthogonal 
transformations  (stably  preserving  the  eigenvalues  or  singular  values)  to  a  condensed  form. 
In  the  symmetric  case,  this  condensed  form  is  a  tridiagonal  matrix;  in  the  nonsymmetric 
case,  the  matrix  is  reduced  to  bidiagonal  form.  Computing  the  eigenvalues  or  singular  val¬ 
ues  of  these  more  structured  matrices  is  much  cheaper  (both  in  terms  of  computation  and 
communication)  than  reducing  the  full  matrices  to  condensed  form,  so  we  do  not  consider 
this  phase  of  computation  here.  The  most  commonly  used  tridiagonal  and  bidiagonal  solvers 
include  MRRR,  Bisection/Inverse  Iteration,  Divide-and-Conquer,  or  QR  iteration  (see  [63], 
for  example).  After  both  eigen-  or  singular  values  and  vectors  are  computed  for  the  con¬ 
densed  forms,  the  eigen-  or  singular  vectors  of  the  original  matrix  can  be  computed  via  a 
back-transformation,  by  applying  the  orthogonal  matrices  that  transformed  the  dense  matrix 
to  tri-  or  bidiagonal.  See  Section  10.1.1  for  more  details  in  the  symmetric  case. 
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LAPACK’s  routines  for  computing  the  symmetric  eigendecomposition  (syev)  and  SVD 
(gesvd)  use  a  similar  approach  to  the  LU  and  QR  routines,  blocking  the  computations  to 
cast  work  in  terms  of  calls  to  matrix  multiplication.  However,  because  the  transformations 
are  two-sided,  a  constant  fraction  of  the  work  is  cast  as  BLAS-2  operations,  like  matrix- 
vector  multiply,  which  are  communication  inefficient.  As  a  result,  these  algorithms  do  not 
minimize  bandwidth  or  latency  costs,  for  any  matrix  dimension;  they  require  communicating 
@(n3)  words,  meaning  the  data  re-use  achieved  for  a  constant  fraction  of  the  work  is  as  low 
as  0(1). 

In  2000,  Bischof,  Lang,  and  Sun  [38,  43]  proposed  a  two-step  approach  to  reducing  a 
symmetric  matrix  to  tridiagonal  form  known  as  Successive  Band  Reduction  (SBR):  first 
reduce  the  dense  matrix  to  band  form  and  then  reduce  the  band  matrix  to  tridiagonal.  The 
advantage  of  this  approach  is  that  the  first  step  can  be  performed  so  that  nearly  all  of  the 
computation  is  cast  as  matrix  multiplication;  that  is,  the  data  re-use  can  be  @(-\/M),  which 
is  communication  optimal.  The  drawback  is  that  reducing  the  band  matrix  to  tridiagonal 
form  is  a  difficult  task,  requiring  0(n2b )  flops  (as  opposed  to  0(nb2)  flops  in  the  case  of  the 
symmetric-indefinite  linear  solver)  and  complicated  data  dependencies.  Chapter  10  and  [30] 
address  performing  this  reduction  in  a  communication-efficient  manner. 

If  only  eigenvalues  are  desired,  then  the  two-step  approach  applied  to  a  dense  symmetric 
matrix  performs  asymptotically  the  same  number  of  flops  as  the  standard  approach  used 
in  LAPACK.  If  eigenvectors  are  also  desired,  then  the  computational  cost  of  the  back- 
transformation  phase  is  higher  for  the  two-step  approach  by  a  constant  factor.  In  terms 
of  the  communication  costs,  the  two-step  approach  can  be  much  more  efficient,  matching 
the  lower  bound  of  Corollary  4.26  and  Theorem  4.27  (see  [17,  Section  6]  for  the  bandwidth 
cost  analysis).  In  order  to  minimize  communication  across  the  entire  computation,  the  two- 
step  approach  also  requires  a  communication-efficient  tall-skinny  QR  decomposition  (during 
the  first  step)  and  one  of  the  algorithms  proposed  in  Chapter  10  (for  the  second  step).  All  of 
the  SBR  algorithms  designed  for  the  symmetric  eigenproblem  can  be  adapted  for  computing 
the  SVD  in  the  nonsymmetric  case. 

Another  communication-optimal  approach  is  to  use  the  spectral  divide-and-conquer  algo¬ 
rithms  described  in  Section  7.1.7,  adapted  for  symmetric  matrices  or  computing  the  SVD.  In 
the  symmetric  case,  a  more  efficient  iterative  scheme  is  presented  in  [116].  These  approaches 
require  efficient  QR  decomposition  and  matrix  multiplication  algorithms  and  perform  a  con¬ 
stant  factor  more  computation  than  the  reduction  approaches. 

7.1.7  Nonsymmetric  Eigendecomposition 

The  standard  approach  for  computing  the  eigendecomposition  of  a  nonsymmetric  matrix 
is  similar  to  the  symmetric  case:  orthogonal  similarity  transformations  are  used  to  reduce 
the  dense  matrix  to  a  condensed  form,  from  which  the  Schur  form  is  computed.  In  the 
nonsymmetric  case,  the  condensed  form  is  an  upper  Hessenberg  matrix  (an  upper  triangular 
matrix  with  one  nonzero  subdiagonal),  and  QR  iteration  (with  some  variations)  is  typically 
used  to  annihilate  the  subdiagonal.  In  this  case,  the  amount  of  data  in  the  condensed 
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form  is  asymptotically  the  same  as  the  original  matrix  (about  n2/ 2  versus  n2),  and  @(n3) 
computation  is  required  to  obtain  Schur  form  from  a  Hessenberg  matrix  (as  opposed  to 
the  symmetric  case,  where  the  data  and  computation  involved  in  solving  the  tridiagonal 
eigenproblcm  are  lower  order  terms).  Thus,  in  determining  the  communication  cost  of  the 
overall  algorithm,  we  cannot  ignore  the  second  phase  of  computation  as  in  Section  7.1.6. 

LAPACK’s  routine  for  the  nonsymmetric  eigenproblem  (geev)  takes  the  reduction  ap¬ 
proach  and  moves  0(n3)  words  in  the  reduction  phase  and  0(n 3)  words  in  the  QR  iteration, 
so  it  is  suboptimal  both  in  terms  of  words  and  messages.  While  there  have  been  approaches 
to  reduce  communication  costs  in  the  reduction  phase  in  a  manner  similar  to  SBR,  reduc¬ 
ing  first  to  band- Hessenberg  and  then  to  Hessenberg  form  (see  [96]),  it  is  an  open  question 
whether  an  asymptotic  reduction  is  possible  and  the  lower  bound  of  Corollary  4.26  and  The¬ 
orem  4.27  is  attainable.  Even  if  the  reduction  phase  can  be  done  optimally,  it  is  also  an 
open  question  whether  QR  iteration  can  be  done  in  an  equally  efficient  manner.  Some  work 
on  reducing  communication  for  multi-shift  QR  iteration  in  the  sequential  model  appears  in 

[113]- 

Because  of  the  difficulties  of  the  reduction  approach,  we  consider  a  different  approach  for 
computing  the  nonsymmetric  eigendecomposition,  called  spectral  divide-and-conquer.  In  this 
approach,  the  goal  is  to  compute  an  orthogonal  similarity  transformation  which  transforms 
the  original  matrix  into  a  block  upper  triangular  matrix,  thereby  generating  two  smaller 
subproblems  whose  Schur  form  can  be  combined  to  compute  the  Schur  form  of  the  original 
matrix.  While  there  are  a  variety  of  spectral  divide-and-conquer  methods,  we  focus  on  the 
one  proposed  in  [13],  adapted  in  [58],  and  further  developed  in  [17].  This  approach  relies 
on  a  randomized  rank-revealing  QR  factorization  and  communication-optimal  algorithms 
for  QR  decomposition  and  matrix  multiplication.  Under  mild  assumptions,  the  algorithm 
asymptotically  minimizes  communication  (and  is  cache  oblivious  if  the  QR  decomposition 
and  matrix  multiplication  algorithms  are)  but  involves  a  constant  factor  more  computation 
than  the  reduction  and  QR  iteration  approach.  The  algorithm  can  be  applied  to  the  gener¬ 
alized  eigenproblem  as  well  as  symmetric  variants  and  the  SVD.  See  [17]  for  full  details  of 
the  algorithm  and  its  communication  costs. 

Note  also  that  to  form  the  eigendecomposition  from  Schur  form  requires  computing  the 
eigenvectors  of  a  (quasi-)triangular  matrix.  The  LAPACK  routine  for  this  computation 
(trevc)  computes  one  eigenvector  at  a  time  and  does  not  minimize  communication.  A 
communication-optimal,  blocked  algorithm  is  presented  in  [17,  Section  5]. 

7.2  Fast  Linear  Algebra 

A  depth-first  traversal  of  the  recursion  tree  given  by  Strassen’s  original  algorithm  minimizes 
bandwidth  cost  in  the  sequential  model.  See  [26,  Section  1.4.1]  for  the  cost  recurrence  and 
solution.  When  the  corresponding  recursive  data  layout  is  used,  the  algorithm  also  minimizes 
latency  cost. 
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Demmel,  Dumitriu,  and  Holtz  [58]  showed  that  nearly  all  of  the  fundamental  algorithms 
in  dense  linear  algebra  can  be  executed  with  asymptotically  the  same  number  of  flops  as 
matrix  multiplication.  Although  the  stability  properties  of  fast  matrix  multiplication  are 
slightly  weaker  than  those  of  classical  matrix  multiplication,  the  authors  show  in  [59]  that 
all  fast  matrix  multiplication  is  stable.  Further,  in  [58]  they  show  that  fast  linear  algebra 
can  be  made  stable  at  the  expense  of  only  a  polylogarithmic  (i.e.,  polynomial  in  log  n)  factor 
increase  in  cost.  That  is,  to  maintain  stability,  one  can  use  polylogarithmically  more  bits 
to  represent  each  floating  point  number  and  to  compute  each  flop.  While  this  increases  the 
time  to  perform  one  flop  or  move  one  word,  it  does  not  change  the  number  of  flops  computed 
or  words  moved  by  the  algorithm. 

The  bandwidth  cost  analysis  for  the  algorithms  presented  in  [58]  is  given  in  [29].  While 
stability  and  computational  complexity  were  the  main  concerns  in  [58] ,  in  [29]  the  bandwidth 
cost  of  the  linear  algebra  algorithms  is  shown  to  match  the  lower  bound  of  Corollary  6.5. 
To  minimize  latency  costs,  the  analysis  in  [29]  must  be  combined  with  the  shape-morphing 
technique  of  [32], 

7.3  Conclusions  and  Future  Work 

While  there  exist  communication-optimal  algorithms  for  all  of  the  computations  discussed 
in  this  chapter  and  presented  in  Table  7.1,  much  work  remains  to  be  done.  In  some  cases, 
implementations  of  theoretically  optimal  algorithms  have  demonstrated  performance  im¬ 
provements  over  previous  algorithms;  in  others,  implementations  are  still  in  progress.  We 
highlight  in  this  section  possible  future  directions  of  algorithmic  improvement — finding  ways 
to  reduce  costs  by  constant  factors  or  develop  other  theoretically  optimal  algorithms  that 
might  perform  more  efficiently  in  practice. 

For  example,  it  may  be  possible  to  minimize  both  words  and  messages  for  one-sided 
factorizations  without  relying  on  tournament  pivoting,  Householder  reduction  trees,  or  the 
block-Aasen  algorithm  by  using  more  standard  blocked  algorithms  (as  in  LAPACK)  and 
adding  a  second  level  of  blocking.  This  could  produce  a  communication-optimal  LDLT 
algorithm.  Other  open  problems  with  respect  to  QR  decomposition  include  reconstructing 
Householder  vectors  from  the  tree  representation  of  TSQR  and  modifying  the  algorithm  of 
Frens  and  Wise  to  be  more  computationally  efficient. 

Because  there  are  many  variants  of  eigenvalue  and  singular  value  problems,  several  large- 
constant  improvements  are  possible,  particularly  in  the  cases  of  computing  eigenvectors  of 
a  symmetric  matrix  and  singular  vectors  of  a  nonsymmetric  matrix.  The  communication- 
optimal  nonsymmetric  eigensolver  also  suffers  from  high  computational  costs  and  requires 
more  optimization  to  be  competitive  in  practice. 

Finally,  the  fast  linear  algebra  algorithms  are  asymptotically  as  efficient  as  fast  matrix 
multiplication,  but  they  have  never  been  demonstrated  to  perform  better  than  classical 
algorithms  in  practice.  One  challenge  to  minimizing  the  constant  factors  is  to  optimize  the 
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Chapter  8 

Parallel  Algorithms  and  their 
Communication  Costs 


While  Chapter  7  summarizes  algorithms  for  the  sequential  case,  we  focus  in  this  chapter 
on  parallel  algorithms.  The  main  contribution  of  the  chapter  is  a  comprehensive  (though 
certainly  not  exhaustive)  summary  of  communication-optimal  parallel  algorithms  for  dense 
linear  algebra  computations.  Again,  we  provide  references  to  the  papers  providing  algorith¬ 
mic  details,  communication  cost  analyses,  and  demonstrated  performance  improvements. 
We  consider  the  same  set  fundamental  computations — BLAS,  Cholesky  and  symmetric- 
indefinite  factorizations,  LU  and  QR  decompositions,  eigenvalue  and  singular  value  decom¬ 
positions — and  compare  the  best  parallel  algorithms  with  the  lower  bounds  established  in 
Chapter  4.  The  following  chapters,  Chapters  9-11,  will  focus  on  algorithms  for  particular 
computations. 

Recall  the  distributed-memory  parallel  memory  model  presented  in  Section  2.2.2.  We 
consider  communication  among  a  set  of  P  processors,  all  connected  by  a  fully  connected 
network,  and  we  track  both  words  and  messages  communicated  by  the  parallel  machine 
along  the  critical  path(s)  of  the  algorithm.  All  of  the  algorithms  discussed  in  this  chapter 
assume  a  block-cyclic  data  distribution  (see  Section  2.3.2),  where  a  block  size  of  1  gives  a 
cyclic  distribution  and  a  block  size  of  n/\[P  gives  a  blocked  distribution. 

The  rest  of  this  chapter  is  divided  into  three  sections.  The  first  two  sections  focus  on 
classical  algorithms:  those  that  use  no  more  than  a  constant  factor  more  than  the  minimum 
amount  of  local  memory  required  (Section  8.1)  and  those  that  use  asymptotically  more 
memory  to  reduce  communication  costs  (Section  8.2).  Section  8.3  considers  fast  linear  algebra 
computations. 

The  main  contributions  to  the  set  of  communication-optimal  parallel  algorithms,  appear¬ 
ing  either  in  this  thesis  or  in  manuscripts  written  with  various  sets  of  coauthors,  are  as 
follows: 


•  communication  cost  analysis  for  minimal-memory  parallel  Cholesky  decomposition  al¬ 
gorithms  [24], 
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•  first  minimal- memory  communication-avoiding  parallel  algorithms  for  the  symmetric 
eigendecomposition  and  SVD  [30]  (see  Chapter  10), 

•  first  minimal-memory  communication-optimal  parallel  algorithm  for  the  nonsymmetric 
eigendecomposition  [17],  and 

•  first  communication-optimal  parallel  algorithm  for  parallelizing  Strassen’s  and  other 
fast  matrix  multiplication  algorithms  [20,  105]  (see  Chapter  11). 

Table  8.1  is  an  updated  version  of  Table  6.2  in  [28],  and  some  of  the  discussion  here  is 
based  on  that  paper,  written  with  coauthors  James  Demmel,  Olga  Holtz,  and  Oded  Schwartz. 
The  material  on  Cholesky  decomposition  in  Section  7.1.2  is  based  on  [24],  written  with  the 
same  set  of  coauthors.  Section  8.3  includes  a  summary  of  Chapter  11  and  [20],  written 
written  with  James  Demmel,  Benjamin  Lipshitz,  Olga  Holtz,  and  Oded  Schwartz. 

8.1  Classical  Linear  Algebra  (with  Minimal  Memory) 

In  this  section,  we  discuss  the  current  state-of-the-art  for  algorithms  that  perform  the  clas¬ 
sical  0(n3)  computations  for  dense  matrices  on  parallel  machines,  assuming  no  more  than 
a  constant  factor  of  extra  local  memory  is  used.  For  each  of  the  computations  considered 
here,  we  can  compare  the  communication  complexity  of  the  algorithms  to  the  lower  bounds 
presented  in  Chapter  4,  where  we  fix  the  local  memory  size  to  M  —  @(n2/P).  Table  8.1 
summarizes  the  communication-optimal  algorithms  in  this  case  for  the  most  fundamental 
dense  linear  algebra  computations.  Another  term  for  these  minimal  memory  algorithms  is 
“2D,”  which  was  first  used  to  distinguish  minimal  memory  matrix  multiplication  algorithms 
from  so-called  “3D”  algorithms  that  do  use  more  than  a  constant  factor  of  extra  memory. 
In  these  2D  algorithms,  the  processors  are  organized  in  a  two-dimensional  grid,  with  most 
communication  occurring  within  processor  rows  or  columns. 

Recall  the  lower  bounds  that  applies  to  these  dense  computations,  where  the  number 
of  gijk  operations  is  G  =  @(n3/P).  For  these  values  of  G  and  M,  the  lower  bound  on  the 
number  of  words  communicated  by  any  processor  is  fl(n2/\/P)>  and  the  lower  bound  on  the 
number  of  messages  is  f^v^P)-  In  order  for  an  algorithm  to  be  considered  communication- 
optimal,  we  require  that  its  communication  complexity  be  within  a  polylogarithmic  (in  P) 
factor  of  the  two  lower  bounds  and  that  it  performs  no  more  than  a  constant  factor  more 
computation  than  alternative  algorithms  (ie.,  the  parallel  computational  cost  is  @(n3/P)). 

The  asymptotic  communication  costs  for  ScaLAPACK  algorithms  are  given  in  [44,  Table 
5.8].  Note  that  the  bandwidth  costs  for  those  algorithms  include  a  logP  factor  due  to  the 
assumption  that  collective  communication  operations  ( e.g .,  reductions  and  broadcasts)  are 
performed  with  tree-based  algorithms.  Better  algorithms  exist  for  these  collectives  that  do 
not  incur  the  extra  factor  in  bandwidth  cost.  For  example,  a  broadcast  can  be  performed  with 
a  scatter  followed  by  an  all-gather  (see  [50]  for  more  details).  Thus,  the  extra  logarithmic 
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Computation 

Minimizes  Words 

Minimizes  Messages 

BLAS-3 

[3,  44,  49,  71] 

[3,  44,  49,  71] 

Cholesky 

[44] 

[44] 

Symmetric  Indefinite 

[14,  44] 

[14] 

LU 

[44,  80] 

[80] 

QR 

[44,  62] 

[62] 

Sym  Eig  and  SVD 

[17,  30,  44] 

[17,  30] 

Nonsym  Eig 

[17] 

[17] 

Table  8.1:  Parallel  classical  algorithms  attaining  communication  lower  bounds  assuming 
minimal  memory  is  used.  That  is,  these  algorithms  have  computational  cost  @(n3/P)  and 
use  local  memory  of  size  0(n2/P).  We  separately  list  algorithms  that  minimize  only  the 
number  of  words  moved  and  algorithms  that  also  minimize  the  number  of  messages. 


factor  is  not  inherent  in  the  algorithm,  only  in  the  way  collective  operations  were  first 
implemented  in  SCALAPACK. 

8.1.1  BLAS  Computations 

As  in  the  sequential  case,  the  lower  bounds  of  Section  4. 1.2.1  apply  to  all  three  levels  of 
BLAS  computations,  but  the  only  parallel  algorithms  that  can  attain  the  bounds  are  BLAS-3 
routines.  ScaLAPACK  [44]  uses  the  Parallel  BLAS  library,  or  PBLAS,  which  has  algorithms 
for  matrix  multiplication  (and  its  variants)  and  triangular  solve  that  minimize  both  words 
and  messages.  The  history  of  communication-optimal  matrix  multiplication  goes  back  to 
Cannon  [49].  While  Cannon’s  algorithm  is  asymptotically  optimal,  a  more  robust  and  tunable 
algorithm  known  as  SUMMA  [3,  71]  is  more  commonly  used  in  practice.  For  a  more  complete 
summary  of  minimal  memory,  or  2D,  matrix  multiplication  algorithms,  see  [95,  Section  4], 

8.1.2  Cholesky  Decomposition 

ScaLAPACK’s  parallel  Cholesky  routine  (pxposv)  minimizes  both  words  and  messages  on 
distributed-memory  machines.  See  [24]  for  a  description  of  the  algorithm  and  its  communi¬ 
cation  analysis.  Note  that  the  bandwidth  cost  in  that  paper  includes  a  logP  factor  that  can 
be  removed  with  a  more  efficient  broadcast  routine. 

8.1.3  Symmetric-Indefinite  Decompositions 

ScaLAPACK’s  parallel  LDLT  routine  (pxsysv)  minimizes  words  moved,  but  it  is  not  latency 
optimal.  For  the  sequential  case,  Chapter  9  and  [14]  present  an  alternate  communication- 
optimal  symmetric-indefinite  factorization  based  on  a  blocked  version  of  Aasen’s  LT LT  fac¬ 
torization  [1]  (where  T  is  tridiagonal).  While  neither  Chapter  9  or  [14]  address  the  parallel 
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case,  the  algorithm  can  be  parallelized  to  minimize  communication  in  the  minimal  memory 
case.  This  algorithm  computes  the  factorization  in  two  steps,  first  reducing  the  symmet¬ 
ric  matrix  to  band  form  and  then  factoring  the  band  matrix.  The  first  step  requires  the 
use  of  a  communication-efficient  tall-skinny  LU  decomposition  routine  (as  used  in  CALU  in 
Section  8.1.4).  The  second  step  can  be  performed  naively  with  a  nonsymmetric  band  LU 
factorization  (with  no  parallelism)  with  computational  and  communication  costs  that  do  not 
asymptotically  exceed  the  costs  of  the  first  step.  We  leave  the  details  of  the  parallelization  of 
the  reduction  to  band  form  and  a  more  complete  consideration  of  efficient  parallel  methods 
for  factoring  the  band  matrix  to  future  work. 

8.1.4  LU  Decomposition 

As  in  the  symmetric-indefinite  case,  ScaLAPACK’s  LU  routine  (pxgesv)  minimizes  the  num¬ 
ber  of  words  moved  but  not  the  number  of  messages.  Grigori,  Demmel,  and  Xiang  [80]  pro¬ 
pose  a  communication-optimal  algorithm  known  as  Communication-Avoiding  LU  (CALU) 
which  uses  “tournament  pivoting,”  a  scheme  that  makes  different  pivoting  decisions  from  par¬ 
tial  pivoting.  The  theoretical  numerical  stability  guarantees  are  slightly  weaker  for  CALU 
than  for  algorithms  using  partial  pivoting,  though  in  practice  both  approaches  show  simi¬ 
lar  behavior  (see  [80]  for  more  details).  The  use  of  tournament  pivoting  allows  the  overall 
algorithm  to  reach  both  bandwidth  and  latency  cost  lower  bounds. 

In  the  sequential  case,  partial  pivoting  can  be  maintained  while  still  minimizing  both 
words  and  messages  using  a  technique  known  as  shape-morphing  [32],  Unfortunately,  the 
idea  of  shape- morphing  is  unlikely  to  yield  the  same  benefits  in  the  parallel  case.  Choosing 
pivots  for  each  of  n  columns  lies  on  the  critical  path  of  the  algorithm  and  therefore  must  be 
done  in  sequence.  Each  pivot  choice  either  requires  at  least  one  message  or  for  the  whole 
column  to  reside  on  a  single  processor.  This  seems  to  require  either  Q(n)  messages  or  Q(n2) 
words  moved,  which  both  asymptotically  exceed  the  respective  lower  bounds. 

8.1.5  QR  Decomposition 

ScaLAPACK’s  QR  routine  (pxgeqrf )  also  minimizes  bandwidth  cost  but  not  latency  cost.  It 
is  a  parallelization  of  the  LAPACK  algorithm,  using  one  Householder  vector  per  column.  At 
nearly  the  same  time  as  the  development  of  the  CALU  algorithm,  Demmel,  Grigori,  Hoern- 
men,  and  Langou  [62]  developed  the  Communication-Avoiding  QR  (CAQR)  algorithm  that 
minimizes  both  words  and  messages.  Note  that  the  CAQR  algorithm  satisfies  the  assump¬ 
tions  of  Theorem  4.25 — it  maintains  forward  progress  and  needs  not  compute  T  matrices 
of  dimension  2  or  greater — and  attains  both  the  bandwidth  cost  lower  bound  stated  in  the 
theorem  as  well  as  the  latency  lower  bound  corollary.  The  principal  innovation  of  parallel 
CAQR  is  the  factorization  of  a  tall-skinny  submatrix  using  only  one  reduction,  for  a  cost 
of  0(log  P)  messages  rather  than  communicating  once  per  column  of  the  submatrix.  The 
algorithm  for  tall-skinny  matrices  is  called  TSQR  and  is  an  important  subroutine  not  only 
for  general  QR  decomposition  but  also  many  other  computations  (see  [114],  for  example).  In 
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order  to  obtain  this  reduction,  the  authors  abandoned  the  conventional  scheme  of  computing 
one  Householder  vector  per  column  and  instead  use  a  tree  of  Householder  transformations. 
This  results  in  a  different  representation  of  the  orthogonal  factor,  though  it  has  the  same 
storage  and  computational  requirements  as  the  conventional  scheme.  It  is  possible  to  per¬ 
form  TSQR  and  recover  the  conventional  storage  scheme  without  asymptotically  increasing 
communication  costs,  but  we  leave  details  of  the  approach  to  future  work. 

The  TSQR  operation  is  effectively  a  reduction  operation,  and  in  the  distributed  memory 
parallel  case,  the  optimal  reduction  tree  is  binary.  Applied  to  an  m  x  n  matrix  such  that 
m/P  >  n,  the  communication  costs  of  the  reduction  are  0(n2  log  P)  words  and  O(logP) 
messages.  As  mentioned  in  the  beginning  of  Section  8.1,  it  is  possible  to  remove  the  logarith¬ 
mic  factor  in  the  bandwidth  cost  for  some  simple  reductions;  it  is  an  open  question  whether 
this  is  possible  with  TSQR. 

As  in  the  sequential  case,  for  communication-optimal  algorithms  performing  rank-reveal¬ 
ing  QR  decompositions,  see  [61]  and  [17,  Algorithm  1], 

8.1.6  Symmetric  Eigendecomposition  and  SVD 

As  in  the  case  of  one-sided  factorizations,  ScaLAPACK’s  routines  for  the  two-sided  fac¬ 
torizations  for  the  symmetric  eigendecomposition  (pxsyev)  and  SVD  (pxgesvd)  minimize 
bandwidth  cost  but  fail  to  attain  the  latency  cost  lower  bound.  However,  by  using  the  two- 
step  SBR  approach  described  in  Section  7.1.6,  we  can  minimize  both  words  and  messages. 
In  the  two-step  approach,  the  first  step  requires  an  efficient  parallel  tall-skinny  QR  factor¬ 
ization,  like  TSQR.  For  the  second  step,  the  use  of  the  parallel  algorithm  given  in  Chapter 
10  and  [30]  ensures  the  overall  algorithm  achieves  the  lower  bounds. 

8.1.7  Nonsymmetric  Eigendecomposition 

For  the  nonsymmetric  eigenproblem,  ScaLAPACK’s  routine  (pxgeev)  minimizes  neither 
words  nor  messages.  As  in  the  sequential  case,  it  is  an  open  problem  to  minimize  communi¬ 
cation  using  the  standard  approach  of  reduction  to  Hessenberg  form  followed  by  Hessenberg 
QR  iteration.  Ongoing  algorithmic  and  implementation  development  on  the  ScaLAPACK 
code  has  been  improving  the  communication  costs  and  speed  of  convergence;  see  [77]  for 
details. 

For  the  purposes  of  minimizing  communication,  we  consider  a  different  approach  based 
on  spectral  divide-and-conquer.  As  in  the  sequential  case,  by  using  the  method  of  [13,  17, 
58],  we  can  minimize  both  words  and  messages  with  the  use  of  optimal  parallel  QR  decom¬ 
position  and  matrix  multiplication  subroutines  (see  [17]  for  the  communication  analysis). 
Also,  computing  the  eigenvectors  from  Schur  form  requires  an  optimal  algorithm  which  is 
presented  in  [17,  Section  5]. 
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8.2  Classical  Linear  Algebra  (with  Extra  Memory) 

The  lower  bounds  proved  in  Chapter  4  depend  on  the  number  of  flops  performed  and  the 
size  of  the  local  memory.  In  Section  8.1  we  assume  the  local  memory  to  be  fixed  at  @(n2/P); 
however,  the  actual  amount  of  available  local  memory  is  a  machine  parameter  rather  than 
dependent  on  the  problem  size.  Furthermore,  the  communication  lower  bound  is  proportional 
to  the  inverse  of  the  square  root  of  the  local  memory  size,  so  increasing  the  local  memory 
size  (potentially)  decreases  the  communication  required  of  a  computation.  Thus,  the  lower 
bounds  suggest  that  by  using  local  memory,  we  can  decrease  the  amount  of  communication 
performed.  Note  that  the  existence  of  memory- independent  lower  bounds,  given  in  Section 
6.2,  provide  a  limit  on  how  much  this  tradeoff  can  be  exploited.  In  this  section,  we  discuss 
classical  algorithms  that  are  able  to  take  advantage  of  this  opportunity. 

8.2.1  Matrix  Multiplication 

The  first  algorithms  to  reduce  the  communication  cost  of  parallel  matrix  multiplication  by 
using  extra  memory  were  developed  by  Aggarwal,  Chandra,  and  Snir  (in  the  LPRAM  model) 
[4]  and  Berntsen  (on  a  hypercube  network)  [35].  Aside  from  using  extra  local  memory,  the 
main  innovation  in  these  algorithms  is  to  divide  the  work  for  computing  single  entries  in  the 
output  matrix  among  multiple  processors;  in  the  case  of  Cannon’s  and  other  2D  algorithms,  a 
single  processor  computes  all  of  the  scalar  multiplications  and  additions  for  any  given  output 
entry.  See  [16,  Figure  2]  for  a  visual  classification  of  ID,  2D,  and  3D  matrix  multiplication 
algorithms  based  on  the  assignment  of  work  to  processors  (note  that  the  classification  is 
applied  to  sparse  algorithms  in  that  paper).  For  a  more  complete  summary  of  3D  dense 
matrix  multiplication  algorithms,  see  [95,  Section  5]. 

The  extra  memory  required  for  3D  algorithms  is  0(n2/P2/3),  or  (9(p!/3)  times  the  min¬ 
imal  amount  of  memory  required  to  store  the  input  and  output  matrices.  The  communica¬ 
tion  savings,  compared  to  2D  algorithms,  is  a  factor  of  0(P 1/6)  words  and  0(P l^2)  messages. 
However,  these  algorithms  provide  only  a  binary  alternative  to  2D  algorithms — only  if  enough 
memory  is  available  can  3D  algorithms  be  employed.  McColl  and  Tiskin  [112]  showed  how 
to  navigate  the  tradeoff  continuously  (in  the  BSPRAM  model):  for  example,  given  local 
memory  of  size  0(n2/P2"+2/3),  their  algorithm  achieves  bandwidth  cost  of  0{n2 / P2/3~Q)  for 
0  <  a  <  1/6.  Later,  Solomonik  and  Dcmmel  [137]  developed  and  implemented  a  practical 
version  of  the  algorithm  which  generalizes  the  2D  SUMMA  algorithm.  Because  their  ap¬ 
proach  fills  the  gap  between  2D  and  3D  algorithms,  the  authors  coined  “2.5D”  to  describe 
their  algorithm. 

Both  of  the  algorithms  exhibiting  the  continuous  tradeoff  are  iterative  algorithms.  A 
recursive  algorithm,  which  is  a  simple  extension  of  the  parallel  Strassen  algorithm  given  in 
Chapter  11  and  [20],  also  achieves  the  same  asymptotic  communication  costs  (see  also  [105] 
for  more  details).  The  recursive  algorithm  is  generalized  to  rectangular  matrices  in  [60,  106], 
and  the  2.5D  algorithm  [137]  is  generalized  to  rectangular  matrices  in  [110],  though  it  is  not 
communication  optimal  in  all  cases. 
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8.2.2  Other  Linear  Algebra  Computations 

There  are  only  a  few  known  algorithms  for  the  rest  of  linear  algebra  that  are  able  to  navigate 
the  memory-communication  tradeoff,  but  none  as  successfully  as  matrix  multiplication.  In 
particular,  in  the  case  of  matrix  multiplication,  both  bandwidth  and  latency  costs  can  be  re¬ 
duced  with  the  use  of  extra  memory.  In  the  case  of  all  other  algorithms  for  computations  that 
involve  more  dependencies  than  matrix  multiplication,  there  exists  a  second  tradeoff  between 
bandwidth  and  latency  costs.  That  is,  decreasing  the  bandwidth  cost  below  @(n2/P1/2)  in¬ 
creases  the  latency  cost  above  0 ( xfP)  (or  vice-versa).  Therefore,  the  bandwidth  cost  and 
latency  cost  lower  bounds  are  not  simultaneously  achieved  for  M  3>  Ll(n2/P).  It  remains 
an  open  question  whether  this  is  a  necessary  tradeoff. 

Tiskin  [140]  presents  a  recursive  algorithm  in  the  BSP  model  for  LU  decomposition  with¬ 
out  pivoting  that  exhibits  the  same  tradeoff  between  communication  and  synchronization  in 
that  model.  This  algorithm  can  be  applied  to  symmetric  positive-definite  matrices,  though 
it  uses  explicit  triangular  matrix  inversion  and  multiplication  (ignoring  stability  issues)  and 
also  ignores  symmetry.  Lipshitz  [106]  provides  a  similar  algorithm  for  Cholesky  decompo¬ 
sition,  along  with  a  recursive  algorithm  for  triangular  solve,  that  has  the  same  asymptotic 
communication  costs  but  exploits  symmetry  and  maintains  the  stability  of  the  minimal- 
memory  algorithm.  These  Cholesky  algorithms  achieve  a  bandwidth  cost  of  0(n2/Pa)  and 
latency  cost  of  0(Pa )  for  1/2  <  a  <  2/3. 

In  a  later  paper,  Tiskin  [141]  incorporate  pairwise  pivoting  into  a  new  algorithm  with  the 
same  asymptotic  costs  as  in  [140].  While  pairwise  pivoting  is  not  generally  considered  stable 
enough  in  practice,  the  approach  is  generic  enough  to  apply  to  QR  decomposition  based  on 
Givens  rotations.  However,  the  algorithm  seems  to  be  of  only  theoretical  value:  for  example, 
constants  are  generally  ignored  in  the  analysis,  even  for  computational  costs. 

Solomonik  and  Demmcl  [137]  devise  a  stable  LU  factorization  algorithm  that  uses  tourna¬ 
ment  pivoting  and  achieves  the  same  asymptotic  costs  as  the  algorithm  in  [140] ;  they  demon¬ 
strate  competitive  performance  compared  to  minimal- memory  LLI  algorithms.  This  work 
(both  algorithm  and  implementation)  is  extended  to  the  symmetric  positive  definite  case 
in  [72],  We  leave  the  development  of  practical  algorithms  for  QR  and  symmetric- indefinite 
factorizations  (as  well  as  eigenvalue  and  singular  value  decompositions)  to  future  work. 

8.3  Fast  Linear  Algebra 

Compared  to  classical  linear  algebra,  much  less  work  has  been  done  on  the  parallelization  of 
fast  linear  algebra  algorithms.  Because  there  is  not  as  rich  a  history  of  minimal-memory  fast 
algorithms,  we  do  not  differentiate  between  minimal-memory  and  extra-memory  algorithms 
in  this  section.  Note  that  the  fast  algorithms  that  do  exist  are  analogous  to  the  classical  extra¬ 
memory  algorithms  of  Section  8.2:  they  can  be  executed  with  M  =  0(n2/P)  if  necessary  but 
also  can  exploit  extra  memory  at  reduced  communication  costs  if  possible.  While  Strassen’s 
matrix  multiplication  has  been  efficiently  parallelized,  both  in  theory  and  in  practice,  there 
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are  only  a  few  theoretical  results  for  other  fast  matrix  multiplication  algorithms  and  for  other 
linear  algebra  computations. 

McColl  and  Tiskin  [112]  present  a  parallelization  of  any  Strassen-like  matrix  multiplica¬ 
tion  algorithm  (see  Definition  6.1)  in  the  BSPR.AM  model  that  achieves  a  bandwidth  cost 
of  O(n2/P2/W0_"^0_2))  words  using  local  memory  of  size  0{n2  /  P2^0+2a)^  where  u0  is  the 
exponent  of  the  computational  cost  of  the  algorithm  and  0  <  a  <  1/2  —  1/ujq.  This  algo¬ 
rithm  is  communication  optimal  for  any  local  memory  size;  in  particular,  choosing  maximum 
a  achieves  a  minimal- memory  algorithm,  and  choosing  minimum  a  achieves  the  memory- 
independent  lower  bound  (given  in  Theorem  6.8  for  Strassen’s  algorithm).  Chapter  11  (see 
also  [20])  presents  a  more  practical  version  of  the  algorithm,  with  communication  analysis 
in  the  distributed-memory  model,  described  in  Section  2.2.2,  as  well  as  an  implementation 
with  performance  results.  For  more  detailed  implementation  description  and  performance 
results,  see  [105,  106]. 

We  note  that  the  recursive  algorithms  in  Section  8.2.2  that  use  square  matrix  multiplica¬ 
tion  as  a  subroutine  can  benefit  from  a  fast  matrix  multiplication  algorithm.  In  particular, 
the  triangular  solve  and  Cholesky  decomposition  algorithms  of  [106,  Section  5]  and  the  algo¬ 
rithms  of  [141]  attain  the  same  computational  costs  as  the  matrix  multiplication  algorithm 
used  and  similarly  navigate  the  communication-memory  tradeoff.  However,  these  algorithms 
have  only  been  theoretically  analyzed — no  implementations  exist  yet.  We  leave  the  imple¬ 
mentation  of  these  known  algorithms  and  the  development  of  new  algorithms  for  the  rest  of 
linear  algebra  to  future  work. 

8.4  Conclusions  and  Future  Work 

As  in  the  sequential  case,  there  are  many  constant-factor  improvements  possible  for  the 
algorithms  discussed  in  Section  8.1  and  presented  in  Table  8.1.  In  particular,  many  imple¬ 
mentations  are  in  progress  for  demonstrating  performance  benefits  of  the  new  algorithms  for 
symmetric-indefinite  factorizations  and  computing  the  symmetric  eigendecomposition  and 
SVD. 

Furthermore,  the  algorithms  presented  in  Table  8.1  assume  limitations  on  local  memory 
that  are  often  not  necessary,  especially  in  strong-scaling  scenarios.  Developing  and  opti¬ 
mizing  extra-memory  algorithms  is  an  important  area  of  research  for  developing  scalable 
algorithms.  As  mentioned  in  Section  8.2.2,  many  open  algorithmic  problems  remain. 

Finally,  the  held  of  using  fast  parallel  algorithms  is  wide  open.  While  Strassen’s  algorithm 
has  been  shown  to  be  effective  in  practice,  making  it  robust  for  library  deployment,  applying 
it  to  other  linear  algebra  computations,  and  discovering  and  developing  even  faster  matrix 
multiplication  algorithms  are  all  areas  of  future  work. 
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Chapter  9 

Communication- Avoiding 
Symmetric-Indefinite  Factorization 


The  focus  of  this  chapter  is  a  symmetric-indefinite  factorization  algorithm  that  minimizes 
communication  costs.  The  main  contributions  are 

•  a  new  algorithm  that  is  asymptotically  more  communication  efficient  than  alternative 
algorithms, 

•  a  proof  of  its  backward  stability  (subject  to  a  growth  factor), 

•  a  proof  of  its  communication  optimality  in  the  sequential  model,  and 

•  numerical  experiments  measuring  stability  of  the  algorithm  with  both  randomly  gen¬ 
erated  matrices  and  matrices  arising  in  real  applications. 

The  algorithm  is  a  block  variant  of  Aasen’s  triangular  tridiagonalization  algorithm  [1],  We 
designed  the  algorithm  so  that  it  can  be  implemented  by  a  sequence  of  operations,  each 
involving  a  constant  number  of  b  x  b  submatrices  (blocks),  where  b  is  a  tunable  parameter. 
Most  of  these  block  operations  perform  @(53)  arithmetic  operations,  which  implies  that  the 
computation  to  communication  ratio  of  the  algorithm  is  0(5).  Furthermore,  we  use  a  block- 
contiguous  data  layout  (so  that  blocks  are  always  contiguous  in  slow  memory,  see  Section 
2.3.1),  implying  that  each  block  operation  requires  only  0(1)  messages  and  0(62)  words. 

Matrix  algorithms  with  such  a  structure  usually  perform  well  when  implemented  on 
sequential  or  shared-memory  parallel  computers.  They  can  usually  be  adapted  to  distributed- 
memory  parallel  computers,  but  these  adaptations  are  often  intricate  and  far  from  trivial. 
The  focus  here  is  on  the  sequential  block  algorithm  and  its  memory-hierarchy  performance. 
See  [15]  for  a  description  of  a  shared-memory  parallel  implementation  of  the  algorithm  and 
its  performance.  We  do  not  discuss  distributed-memory  parallelization  in  this  chapter. 
Aasen’s  algorithm  [1]  factors 


A  =  PtLTLtP , 
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where  P  is  a  permutation  matrix  selected  for  numerical  stability,  L  is  lower  triangular  (with 
ones  on  the  diagonal  and  L,j  <  1),  and  T  is  symmetric  and  tridiagonal.  The  algorithm 
performs  n3/3  +  o(n3)  arithmetic  operations;  it  improves  upon  an  earlier  algorithm  by  Parlett 
and  Reid  that  computes  the  same  factorization  in  2n3/3  +  o(n3)  operations  [119].  Neither 
algorithm  is  used  extensively;  a  few  years  later  Kaufman  and  Bunch  discovered  a  similar 
factorization  that  proved  to  be  more  popular,  one  in  which  the  tridiagonal  T  is  replaced  by 
a  matrix  that  is  block  diagonal  with  2x2  and  lxl  blocks  [47]. 

Like  other  early  factorizations,  the  algorithms  of  Aasen  and  of  Parlett  and  Reid  are 
not  communication  efficient  even  for  very  simple  memory  hierarchies.  If  M  <  n2/ 8,  both 
algorithms  transfer  0(n3)  words  between  fast  and  slow  memory,  which  is  very  inefficient.  An 
implementation  of  the  Bunch-Kaufman  factorization  that  transfers  only  0(min(n3,  ^  • n 2)  = 
0(n4/M)  words  was  later  discovered,1  and  this  implementation  was  included  in  LAPACK  [8]. 
More  recently,  Rozloznfk,  Shklarski  and  Toledo  discovered  how  to  compute  the  Parlett-Reid- 
Aasen  factorization  with  the  same  communication  efficiency  [125]. 

In  this  chapter,  we  describe  and  analyze  a  stable  symmetric  factorization  algorithm  that 
is  communication  avoiding;  it  requires  only  0(n3/\/M)  words  moved.  In  terms  of  commu¬ 
nication,  this  is  much  more  efficient  than  any  existing  symmetric  indefinite  factorization. 
However,  the  algorithm  produces  a  T  that  is  banded  rather  than  tridiagonal.  To  achieve 
this  communication  efficiency,  the  half  bandwidth  of  T  is  @(\/M).  We  also  show  that  the 
resulting  T  can  be  factored  further  in  a  way  that  is  also  communication  efficient,  and  the 
resulting  factorization  allows  linear  systems  of  equations  to  be  solved  quickly. 

Our  algorithm  is  fundamentally  a  block  version  of  Aasen’s  algorithm,  and  we  will  refer 
to  it  as  the  block-Aasen  algorithm.  While  the  methodology  of  producing  block  matrix 
algorithms  from  element-by-element  algorithms  is  well  understood,  applying  it  to  this  case 
proved  to  be  challenging.  The  first  block-Aasen  algorithm  that  we  designed  was  highly 
unstable.  In  Aasen’s  original  algorithm,  diagonal  elements  of  T  are  computed  by  solving  a 
scalar  equation.  In  the  block  version,  this  scalar  equation  transforms  into  a  linear  system  of 
equations  whose  solution  is  a  diagonal  block  of  T,  which  is  symmetric.  But  the  system  itself 
is  unsymmetric  and  the  symmetry  of  the  solution  is  implicit.  When  the  system  is  solved  in 
floating  point  arithmetic,  the  computed  block  of  T  can  have  a  non-negligiblc  skew-symmetric 
component  in  addition  to  its  symmetric  part,  and  this  excites  an  instability.  To  address  this 
difficulty,  we  designed  an  algorithm  that  produces  a  symmetric  T  even  in  floating  point. 

The  rest  of  the  chapter  is  organized  as  follows.  We  present  the  block-Aasen  algorithm  in 
Section  9.1.  Section  9.2  analyzes  the  stability  of  the  algorithm,  and  Section  9.3  its  computa¬ 
tion  and  communication  complexity.  Numerical  experiments  presented  in  Section  9.4  provide 
additional  insights  into  the  behavior  of  the  algorithm.  Section  9.5  presents  our  conclusions 
from  this  work. 

1The  0(n4/M)  bound  is  attained  when  M  >  n.  In  this  regime,  the  algorithm  factors  panels  of  roughly 
M/n  columns.  Updating  a  trailing  submatrix  of  dimension  @(n)  after  the  factorization  of  @(n/(M/n))  such 
panels  transfers  0(n4/M)  words.  When  M  <  n,  the  algorithm  transfers  0(n3)  words;  in  this  regime  the  fast 
memory  has  no  significant  beneficial  effect. 
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All  of  the  results  in  this  chapter  appear  in  [14],  written  with  coauthors  Dulceneia  Becker, 
James  Demmel,  Jack  Dongarra,  Alex  Druinsky,  Inon  Peled,  Oded  Schwartz,  Sivan  Toledo, 
and  Ichitaro  Yamazaki.  The  companion  paper  [15],  written  with  the  same  coauthors,  de¬ 
scribes  a  shared-memory  parallel  implementation  of  the  algorithm  and  presents  its  perfor¬ 
mance  compared  with  several  alternatives;  we  do  not  present  that  data  here.  This  work 
received  a  Best  Paper  award  at  the  2013  International  Parallel  and  Distributed  Processing 
Symposium  (IPDPS). 

9.1  Block- Aasen  Algorithm 

To  keep  the  notation  simple,  we  initially  ignore  pivoting  in  the  description  of  the  algorithm. 
The  algorithm  factors  the  n  x  n  matrix  A  into 

A  =  LTLt , 

where  L  is  unit  lower  triangular  and  T  is  symmetric  and  banded  with  half  bandwidth  b  (he., 
TV]  —  0  if  |*  —  j\  >  b).  The  algorithm  processes  the  matrices  in  aligned  blocks  of  size  b  x  b 
(except  for  the  trailing  blocks  which  might  be  smaller).  The  algorithm  is  a  block  version  of 
Aasen’s  algorithm,  so  we  view  T  as  a  block  tridiagonal  matrix  with  triangular  blocks  in  the 
positions  immediately  adjacent  to  the  main  diagonal. 

To  describe  the  algorithm  we  must  specify  three  auxiliary  matrices.  The  first  is  a  block 
upper-triangular  matrix  with  symmetric  diagonal  blocks  R  that  we  require  to  satisfy 

RT  +  R  =  T. 

The  superdiagonal  blocks  in  R  are  the  same  as  the  corresponding  blocks  in  T,  the  diagonal 
blocks  of  R  are  scaled  copies  of  those  of  T  (with  scaling  1/2),  and  the  subdiagonal  blocks 
in  R  are  zero  (unlike  in  T,  which  is  symmetric).  The  other  two  matrices  are  denoted  by  W 
and  H  and  are  required  to  satisfy 

W  =  RLT 
H  =  TLt. 

Aasen’s  original  algorithm  also  computes  H  (forming  it  was  the  key  step  that  allowed  Aasen 
to  eliminate  half  the  arithmetic  operations  from  Parlett  and  Reid’s  algorithm),  but  it  does 
not  compute  W. 

We  present  the  algorithm  in  the  form  of  block-matrix  equations  each  of  which  defines  one 
or  two  sets  of  blocks  in  these  matrices.  The  blocks  that  are  computed  from  each  equation 
are  underlined,  following  the  notation  of  [85,  Section  11.2].  We  use  capital  /  and  capital  J  to 
denote  block  indices,  and  we  denote  the  block  dimension  of  all  the  matrices  by  N  —  \n/b~\. 
We  denote  blocks  of  matrices  using  indexed  notation  with  block  indices.  For  example,  the 
submatrix  that  is  specified  by  A1+(/_1)fe./b  i+(j-i)b-jb  in  scalar-index  colon  notation  is  denoted 
Ai,j. 
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Figure  9.1:  An  illustration  of  computing  superdiagonal  blocks  of  W  via  matrix  multiplication 
in  Equation  (9.1).  Here  N  —  6  and  J  =  4.  The  blocks  that  participate  in  the  equation  are 
enclosed  in  thick  rectangles,  and  the  blocks  that  are  computed  using  this  equation  are  crossed. 
The  same  notation  is  used  in  other  diagrams  in  this  section. 


The  initialization  step  of  the  algorithm  assigns 


L\:NX  =  (identity  matrix)  1;jv,i- 


That  is,  the  first  b  columns  of  L  have  ones  on  the  diagonal  and  zeros  everywhere  else.  After 
this  initialization,  the  algorithm  computes  a  block  column  of  each  of  the  matrices  in  every 
step.  Step  J  computes  block  column  J+l  of  L  and  block  columns  J  of  T,  H ,  and  W  (diagonal 
blocks  of  W  are  never  needed  so  they  are  not  computed)  according  to  the  formulas: 


11  =  -Rl:J— l,l:j(£j,l:j)T 

Aj,j  =  LJti,j_iW  i;j_ qj  +  (kFi:j_i!j)T(Ljii:j_1)T  +  Lj:jT\j(Ljtj)T 

Aj+l:jV,J  =  J+l  Hj+l,j 

Hj+i,j  —  2~j+i,j(£j,j)T- 


(9.1) 

(9.2) 

(9.3) 

(9.4) 

(9.5) 


9.1.1  Correctness 

We  now  show  that  the  algorithm  is  correct.  Verifying  that  the  blocks  that  are  computed 
in  each  equation  depend  only  on  blocks  that  are  already  known  is  trivial.  Therefore,  we 
focus  on  showing  that  A  =  LTLt  whenever  L  and  T  are  computed  in  exact  arithmetic.  The 
analysis  also  constitutes  a  more  detailed  presentation  of  the  algorithm. 

Equation  (9.1)  computes  a  block  column  of  W,  except  for  the  diagonal  block,  by  multiply¬ 
ing  two  submatrices,  using  the  equation  W  =  RLT ,  as  shown  in  Figure  9.1.  This  guarantees 
that  Wij  =  (RLt)ij  for  all  /  <  J.  The  diagonal  blocks  of  W  are  not  computed  and  we 
define  for  convenience  Wj:j  =  (RLt)j,j  for  all  J.  Because  all  other  blocks  of  W  and  RL 1 
are  zero,  W  =  RLT  . 

Equation  (9.2)  computes  a  diagonal  block  of  T  by  solving  a  two-sided  triangular  linear 
system.  This  linear  system  can  be  solved  by  one  of  the  existing  solvers  which  we  describe 
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Figure  9.2:  An  expression  for  the  diagonal  blocks  of  W. 


below.  The  right-hand  side  matrix  in  this  system, 

A/,j  —  Tj.nj-iWAj-gj  — 

must  be  computed  symmetrically;  this  may  be  done  using  the  BLAS  routine  syr2k  [45]. 
The  equation  guarantees  that 


Aj,j  —  ill  i 


+  Lj'jTjjtLj^j) 


By  noting  that 

Lj,jTJ}j(Lj:j)t  =  +  (Rjtj)T)(LJ)j)T 

=  LjtjRjtj(Ljtj)T  +  ( Lj:jRjj(Ljj)t)t 

and  that  the  diagonal  blocks  of  W  =  RLT  are  Wjj  =  Rj  j(Lj  j)t,  as  shown  in  Figure  9.2, 
we  can  transform  (9.2)  into 

j  \,j-\j)T  +  LjjWjj  +  (Lj,jWj,j)t 

=  L  +  {Ljti:JWi:j:j)T 

=  [LW  +  {LW)t)j,j. 

Substituting  W  =  RLT  we  obtain 

Aj,j  =  (LRLT  +  (LRLt)t)J:j 
=  (L(R  +  Rt)Lt)jj 
=  ( LTLt)j,j . 


Equation  (9.3)  computes  a  block  column  of  H ,  except  for  the  subdiagonal  block,  by 
multiplying  matrices,  as  shown  in  Figure  9.4.  Equation  (9.4)  multiplies  blocks  of  L  and  ih, 
subtracts  the  product  from  a  block  of  A,  and  factors  the  difference  using  an  LU  factorization. 
Equation  (9.5)  solves  a  triangular  linear  system  with  a  triangular  right-hand  side  to  compute 
a  sub  diagonal  block  of  T. 
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A  L 


Lt 


Figure  9.3:  Computing  a  diagonal  block  of  T  in  Equation  (9.2)  by  updating  the  correspond¬ 
ing  block  of  A  and  solving  a  two-sided  triangular  system.  The  letters  below  each  matrix 
describe  only  the  matrices  involved  in  the  expression  for  Ajj]  they  do  not  constitute  a 
matrix  equation. 


Figure  9.5:  Computing  a  block  column  of  L  and  a  subdiagonal  block  of  H  in  Equation  (9.4) 
via  the  LU  factorization  of  an  updated  submatrix  of  A. 


CHAPTER  9.  COMM. -AVOIDING  SYMMETRIC-INDEFINITE  FACTORIZATION  111 


Figure  9.6:  Computing  a  block  of  T  by  solving  a  triangular  system  in  Equation  (9.5). 


Equations  (9.3)  and  (9.5)  guarantee  that  Hjj  =  ( TLt)ij  for  all  I  <  J  and  for  all 
I  =  J  +  1  respectively,  and  because  all  other  blocks  of  H  and  TLT  are  zero,  H  =  TLT. 
Equation  (9.4)  ensures  that  Ajyj  =  ( LH)jyj  for  all  I  >  J,  and  substituting  H  =  TLT  shows 
that  Aj  j  =  ( LTLt)i  j  for  all  I  >  J.  Because  both  A  and  LTLT  are  symmetric,  this  holds 
for  all  I  <  J  as  well,  and  thus  A  =  LT L 7 . 

9.1.2  Solving  Two-Sided  Triangular  Linear  Systems 

We  now  describe  the  procedure  that  solves  the  two-sided  triangular  linear  system  in  Equa¬ 
tion  (9.2).  The  method  is  not  new;  it  is  used  to  reduce  symmetric  generalized  eigenproblems 
to  standard  eigenproblems  and  is  available  in  LAPACK  and  ScaLAPACK  under  the  name 
SYGST  [44,  134].  Even  though  we  are  focused  on  SYGST  in  this  paper,  other  solvers  that 
produce  a  symmetric  solution  would  also  be  suitable  for  the  task.  Examples  of  such  solvers 
are  the  subroutine  REDUC  in  EISPACK  [111,  135]  and  the  algorithms  implemented  in  the 
Elemental  library  [120,  121],  Because  we  apply  the  solver  to  block-sized  problems,  its  flops 
and  communications  costs  do  not  have  a  substantial  impact  on  the  overall  costs  of  the  algo¬ 
rithm.  The  stability  of  the  solver  is  important,  but  as  long  as  it  satisfies  a  bound  similar  to 
the  one  we  prove  for  SYGST  in  Section  9.2,  the  impact  on  the  overall  block-Aasen  algorithm 
is  limited  to  the  size  of  the  constant  in  the  backward  stability  bound. 

To  the  best  of  our  knowledge  the  stability  of  SYGST  has  not  been  previously  analyzed. 
In  order  to  analyze  the  algorithm  we  will  now  describe  the  relevant  details  of  how  it  works. 
The  equation  that  defines  Tjtj  is  of  the  form  LXLT  =  B  with  a  symmetric  right-hand  side 
B.2  A  trivial  way  to  solve  such  systems  is  using  a  conventional  triangular  solver  twice.  That 
is,  to  first  solve  for  L~XB  and  to  then  solve  for  X  =  (L~1B)L~J .  This  method  produces 
a  solution  X  that  is  not  exactly  symmetric  in  finite  precision,  and  is  thus  not  suitable  for 
use  in  the  block-Aasen  algorithm.  Another  approach,  which  leads  to  an  exactly  symmetric 
X  and  which  performs  only  half  the  arithmetic,  is  an  algorithm  that  we  now  describe.  We 
partition  all  of  the  matrices  that  we  introduce  in  this  section  such  that  they  are  all  2  x  2 

2This  section  uses  self-contained  notation,  for  simplicity.  The  matrix  that  we  call  L  here  is  a  diagonal 
block  of  the  lower-triangular  factor  in  the  overall  block-Aasen  factorization.  In  addition,  the  auxiliary 
matrices  H  and  W,  the  dimension  of  the  problem  n  and  the  block  size  6,  all  of  which  we  define  later  in  this 
section,  are  also  distinct. 
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block  matrices  with  first  diagonal  blocks  of  dimensions  b  x  b  and  second  diagonal  blocks  of 
dimensions  (n  —  b)  x  (n  —  b).  To  describe  the  algorithm  we  must  define  an  auxiliary  matrix 
Y,  which  we  require  to  be  a  block  upper  triangular  matrix  with  symmetric  diagonal  blocks 
that  satisfies 

X  =  YT  +  1". 

Such  a  matrix  must  have  the  form 


'  Yn 

y12  ' 

'  0.5Xn 

X12 

.  Y21 

y22 

0 

0.5W22 

We  also  need  two  additional  auxiliary  matrices  H  and  W,  which  we  will  require  to  satisfy 

H  =  XLt,  W  =  YLt. 

The  algorithm  works  by  solving  for  the  underlined  blocks  in  the  following  equations: 


Bu  =  LuXu{Lu)t  (9.6) 

B\2  =  L\\H\2  (9.7) 

H\2  =  0.5Xli(Z/2i)t  +  W12  (9-8) 

W12  =  0.5  Xn(L21)T  +  Xi2(L22f  (9.9) 

B22  =  L2\W  12  +  (L21W12)t  +  L22A22(L22)T .  (9.10) 


The  key  in  this  algorithm  is  to  compute  B22  —  L2iW\2  —  (L2iW\2)t  in  (9.10)  symmetri¬ 
cally,  which  allows  the  algorithm  to  compute  X22  symmetrically  as  well.  Note  that  the 
block  0.5Abi(T2i)T  is  computed  twice;  the  algorithm  trades  off  additional  computation  for 
a  reduction  in  workspace  requirements. 

We  derived  the  equations  in  (9.6)-(9.10)  by  considering  specific  blocks  of  the  equations 

B  =  LXLt,  B  =  LH,  B  =  LW  +  {LW)T,  H  =  YTLT  +  IT,  W  =  YLT. 

The  derivation  is  described  by  diagrams  in  Figures  9.7-9.11. 

We  will  now  verify  the  correctness  of  the  algorithm,  meaning  that  LXLT  =  B  whenever 
X  is  computed  in  exact  arithmetic.  The  algorithm  computes  the  diagonal  and  superdiagonal 
blocks  of  X  and  the  superdiagonal  blocks  of  H  and  W.  The  subdiagonal  block  of  X  is  not 
computed  because  X  is  symmetric.  The  diagonal  and  subdiagonal  blocks  of  H  and  W  are 
also  not  computed,  and  thus  we  are  free  to  define  them  so  that  our  notation  is  simplified. 
We  define  the  uncomputed  blocks  of  H  and  W  such  that  the  corresponding  blocks  of  the 
equations  H  =  XLT  and  W  =  Y LT  hold. 

We  start  by  verifying  that  (LXLt)i2  =  Bi2.  Equation  (9.9)  ensures  that  W\2  =  (YLT)  12 
and  thus  W  =  YLT,  due  to  the  way  we  defined  the  uncomputed  blocks  of  W.  Equation  (9.8) 
guarantees  that  Hi2  =  (YT LT  +  W) \2.  Substituting  W  =  YLT  and  noting  that  YTLT  + 
Y L7  =  XLt  shows  that  HV2  =  ( AWr)12  and  thus  H  =  XLT ,  again  due  to  our  definition 
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B 


L 


Lt 


Figure  9.7:  Computing  the  first  diagonal  block  of  X  by  solving  a  smaller  two-sided  triangular 
system  in  Equation  (9.6). 


Figure  9.8:  Solving  a  triangular  system  to  compute  the  superdiagonal  block  of  H  in  Equa¬ 
tion  (9.7). 


Figure  9.9:  Computing  the  superdiagonal  block  of  W  in  Equation  (9.8)  by  updating  the 
corresponding  block  of  H. 


Figure  9.10:  Computing  the  superdiagonal  block  of  Y  in  Equation  (9.9)  by  updating  the 
corresponding  block  of  W  and  solving  a  triangular  system. 
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Figure  9.11:  Computing  the  second  diagonal  block  of  A"  in  Equation  (9.10)  by  updating  the 
corresponding  block  of  B  and  solving  a  smaller  two-sided  triangular  system. 


of  the  uncomputed  blocks  of  H.  Finally,  Equation  (9.7)  ensures  that  B\2  =  (LH) i2,  and 
substituting  H  =  XLT  shows  that  B 12  =  (LXLT)12. 

Next  we  verify  that  (LXLT)22  =  B22  by  transforming  the  equation  in  Equation  (9.10): 

B22  =  L2iWi2  +  +  L22X22  (yL22y^ 

—  L2iWi2  +  ( L2iWi2)t  +  L22Y22{L22)t  +  (L22Y22(L22)t)t 
=  L2\Wi2  +  +  L22W22  +  ( L22W22)t 

=  ( LW  +  (LW)t)22 
=  ( LYLt  +  LYtLt)22 
=  ( L(Y  +  Yt)Lt)22 
=  (LXLt)22. 

Finally,  Equation  (9.6)  explicitly  ensures  that  (LXLJ)n  =  Bu  and  thus  LXL1  =  B. 

We  did  not  specify  the  dimensions  of  the  blocks;  different  choices  yield  different  algo¬ 
rithms.  If  we  choose  b  =  1,  we  end  up  with  an  algorithm  that  computes  the  columns  of  X 
one  at  a  time,  in  which  (9.10)  iterates  over  remaining  columns.  This  version  is  called  SYGS2 
in  LAPACK  and  ScaLAPACK.  The  costs  in  this  partitioning  are  dominated  by  the  triangu¬ 
lar  solve  in  (9.9)  and  the  symmetric  update  in  (9.10)  which  require  (n  —  i )2  and  2 (n  —  i )2 
flops,  respectively.  Thus,  the  leading  term  in  flop  cost  is  given  by 

n—1  n— 1 

F\  (n)  =  ^  3(n  —  i )2  =  3  =  n3  +  o(n3). 
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If  instead  of  b  =  1  we  choose  some  fixed  b  >  1,  we  obtain  sygst,  which  works  by 
computing  a  total  of  b  columns  of  X  at  a  time.  Equation  (9.6)  here  corresponds  to  a  call 
to  SYGS2  and  Equation  (9.10)  iterates  over  remaining  block  columns.  As  long  as  n  3>  b, 
the  costs  are  again  dominated  by  the  triangular  solve  in  (9.9)  and  the  symmetric  update 
in  (9.10)  which  require  ( n  —  ib)2b  and  2 (n  —  ib)2b  flops,  respectively  (here  i  iterates  over  block 
columns).  Thus,  the  leading  term  in  the  flop  cost  is  given  by 

n/b—1  n/b—1 

Fb(n )  =  ^  3(n  —  ib)2b  =  3 b3  ^  i 2  =  n3  +  o(n3). 

i= 1  i=l 


We  can  also  formulate  the  algorithm  recursively,  with  Adi  being  x  .  This  re¬ 
cursive  version  is  new  and  it  is  communication  avoiding  and  cache  oblivious  even  for  large 
matrices  (we  use  it  only  on  blocks,  so  this  is  not  useful  for  the  blocked  Aasen  algorithm). 
Equations  (9.6)  and  (9.10)  are  recursive  calls.  The  triangular  solves  in  steps  (9.7)  and  (9.9) 
and  the  block  multiplications  in  steps  (9.8),  (9.9)  and  (9.10)  all  contribute  to  the  leading 
term  in  the  flop  cost.  The  product  Xu(L2i)t  is  a  symm  BLAS  call  which  costs  2(n/2)3. 
The  triangular  solves  are  trsm  calls  that  cost  (n/2)3  each,  and  the  product  L2 \ ITj 2  is  a 
GEMM  call  that  costs  2(n/2)3  (we  can  subtract  the  transpose  after  computing  and  subtract¬ 
ing  the  product).  If  we  store  the  XU(L2 i)T  and  reuse  it  in  both  step  (9.8)  and  step  (9.9), 
the  recurrence  is 

FR(n)  =  2^(1) +(>(!)  3 

which  again  yields 


FR{n) 


3  4  3  /  3n 

-  •  -n  +  o{n  ) 


n3  +  o(n3). 


If  we  chose  to  recompute  XU(L2 i)T  in  order  to  run  the  algorithm  in  place,  the  flop  count 
increases  but  is  still  0(n3). 


9.1.3  Pivoting 

Without  pivoting,  the  algorithm  can  break  down  or  become  unstable,  just  like  the  classical 
element-wise  Aasen  algorithm.  In  the  new  algorithm,  blocks  Tj+i:v,j+i  and  Hj+^j  are 
computed  using  an  LU  factorization,  and  without  pivoting,  the  factorization  may  fail  to 
exist  or  may  be  unstable.  Clearly,  we  need  to  pivot  in  Equation  (9.4).  It  turns  out  that  this 
stabilizes  the  algorithm,  as  in  the  element-wise  algorithm. 

We  use  row  pivoting,  meaning  that  step  J  factors 

=  Pj  (Aj+i:jV,J  —  , 

where  Pj  is  a  permutation  matrix,  determined  by  partial  pivoting,  for  example.  There  are 
several  ways  to  compute  this  LU  factorization  in  a  way  that  ensures  that  the  block- Aasen 
algorithm  is  communication  avoiding;  we  list  and  analyze  them  in  Section  9. 3. 2. 2  below. 
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Once  the  LU  factorization  is  computed,  we  also  apply  Pj  to  LJ+]:N  i.j  and  to  the  trailing 
submatrix  Aj+1:Nj+ 1:N.  Applying  the  permutation  to  L  and  to  the  trailing  submatrix  is  not 
trivial  to  do  in  a  communication-avoiding  way,  especially  since  only  the  upper  or  lower  tri¬ 
angle  of  the  trailing  submatrix  is  stored.  The  details  are  explained  below,  in  Section  9. 3. 2. 3. 

9.1.4  Computing  W  and  H 

As  we  show  in  Section  9.3,  the  arithmetic  and  communication  costs  of  our  algorithm  are 
asymptotically  dominated  by  the  computation  that  corresponds  to  Equation  (9.4).  Never¬ 
theless,  if  optimizing  the  computation  that  corresponds  to  the  other  equations  can  yield  any 
savings,  then  pursuing  such  optimizations  would  be  desirable  from  a  practical  standpoint. 
It  turns  out  that  savings  are  possible  in  Equations  (9.1)  and  (9.3).  These  equations  state 
that  individual  blocks  of  H  and  W  are  computed  according  to  the  formulas: 

Hi,j  =  Tij- i  i)T  +  Ti  i  ( Ljj)t  +  Tij+ 1  ( Ljj+i)t 

and 

!Tj,j  =  Rij  ( Ljj)t  +  Rij+ 1  (Ljj+ i)T 

=  0.5  Ti  i  ( Ljj)T  +  Tij+i  ( Lji+i)t 

for  all  /  <  J .  (We  encourage  the  reader  to  review  Figures  9.1  and  9.4  for  a  visualization  of 
these  relations.)  The  blocks  Tjr  ( Lji)t  and  Trj+l  (Ljj+i)t  appear  in  these  equations  twice 
but  need  to  be  computed  only  once.  Avoiding  the  recomputation  of  these  blocks  reduces  the 
number  of  b  x  b  matrix  products  required  to  compute  W  and  H  by  a  ratio  of  5:3,  thereby 
making  the  computation  of  W  essentially  free. 

9.1.5  The  Second  Phase  of  the  Algorithm:  Factoring  T 

There  are  several  single-pass  algorithms  that  efficiently  factor  a  banded  symmetric  matrix. 
All  of  these  algorithms  process  0(b)  rows  and  columns  at  a  time,  so  if  we  choose  a  small 
enough  b  =  0(\/M),  the  total  number  of  words  moved  is  0(nb),  which  makes  them  commu¬ 
nication  efficient  (because  the  size  of  input  and  output  is  0(nb)). 

Algorithms  with  these  properties  include  the  unsymmetric  banded  LU  factorization  with 
partial  pivoting,  Kaufman’s  retraction  algorithm  [99],  and  Irony  and  Toledo’s  snap-back 
algorithm  [94],  Both  retraction  and  snap-back  algorithms  preserve  symmetry  (and  matrix 
inertia);  the  LU  algorithm  destroys  the  symmetry  of  the  band.  All  three  perform  0(nb 2) 
flops  and  produce  a  factorization  that  is  essentially  banded  with  bandwidth  0(b).  All  of 
these  factorizations  can  be  used  to  solve  linear  systems  of  equations  using  0(bn)  arithmetic 
per  right-hand  side  and  0(bn)  words  moved  (for  up  to  0(y/M)  right-hand  sides). 
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9.2  Numerical  Stability 

We  analyze  the  stability  of  the  factorization  of  PAPT  where  P  is  the  permutation  matrix 
generated  by  the  selection  of  pivots.  We  assume  in  the  analysis  that  the  matrix  has  been 
pre-permuted  so  the  algorithm  is  applied  directly  to  PAPT  (rather  than  to  A)  and  that  it 
never  pivots.  The  sequence  of  arithmetic  operations  in  such  a  run  of  the  algorithm  is  identical 
to  that  of  the  pivoting  version  applied  to  A,  except  perhaps  for  the  order  of  summation  in 
inner  products.  Our  analysis  does  not  depend  on  this  ordering  so  our  results  apply  to  the 
pivoting  version.  We  will  use  the  standard  model  of  floating  point  computation  and  several 
well-known  results  regarding  stability  of  fundamental  operations  (see  Section  2.6). 


9.2.1  Stability  of  the  Two-Sided  Triangular  Solver 

Our  notation  in  this  section  is  the  same  as  in  Section  9.1.2:  we  consider  solving  an  n  x  n 
system  and  partition  all  of  the  matrices  such  that  they  are  2x2  block  matrices  with  first 
diagonal  blocks  of  dimensions  bxb  and  second  diagonal  blocks  of  dimensions  (■ n  —  b )  x  (■ n  —  b ), 
where  1  <  b  <  n  —  1.  The  matrix  X  and  the  superdiagonal  blocks  of  H  and  W  represent 
the  computed  floating  point  matrices.  We  use  Y  to  denote  the  exact  matrix 

_  0.5An  X12 

'  "  [  0  O.5X22  _ 

and  we  define  the  diagonal  and  subdiagonal  blocks  of  H  and  W  such  that  the  corresponding 
blocks  of  H  =  XLt  and  W  =  YLT  hold. 


Lemma  9.1.  If  the  two-sided  n  x  n  triangular  system  LX L1  =  B  is  solved  in  floating  point 
arithmetic  using  any  of  the  partitioned  algorithms  given  in  Section  9.1.2,  then  the  computed 
X  satisfies 

B  =  LXLt  +  A,  |A|  <73n_1|L||X||LT|. 

Proof.  The  proof  is  by  strong  induction  on  n.  In  the  base  case  we  are  solving  a  1  x  1  system 
and  the  bound  clearly  holds.  To  make  the  inductive  step,  we  first  show  how  the  backward 
error  A  relates  to  the  backward  errors  of  the  matrix  equations  used  in  the  algorithms.  We 
define  the  matrices  A^  for  t  —  1,  2,  . . . ,  5  such  that 

B  =  LH  +  A(1),  H  =  YTLT  +  W  +  A(2),  W  =  YLT  +  A(3),  B  =  LW  +  {LW)T  +  A(4) 


and 

H  =  XLt  +  A(5). 

Substituting  the  third  formula  into  the  second  one,  we  obtain 

H  =  YtLt  +  YLt  +  A(2)  +  A(3)  =  XLt  +  A(2)  +  A(3), 
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and  then  substituting  this  result  into  the  first  formula  yields 

B  =  LXLt  +  A(1)  +  L{  A(2)  +  A(3)). 
Similarly,  substituting  the  third  formula  into  the  fourth  one  yields 

B  =  LXLt  +  LA(3)  +  (LA(3))t  +  A(4). 


Thus  we  have 

A  (5)  =  a(2)  +  A(3)  (9.11) 

and  two  equations  for  the  backward  error  of  the  solution  of  the  system: 

A  =  A(1)  +  LA(5)  (9.12) 

A  =  LA(3)  +  (LA(3))t  +  A(4).  (9.13) 

We’re  now  ready  to  consider  the  computations  involved  in  the  algorithm.  Recall  from 
Section  9.1.2  that  we  partition  all  matrices  into  2x2  blocks  so  that  the  top  left  block  is  b  x  b 
for  some  1  <  b  <  n  —  1.  Applying  the  lemmas  of  rounding-error  analysis  (given  in  Section 
2.6)  and  the  inductive  hypothesis  to  equations  (9.6)-(9.10)  allows  us  to  derive  the  bounds 


<  l3b^(\L\\X\\L\T)u  (9.14) 

|A^|  <  lb{\L\\H\)\2  (9.15) 

\A^\<lb(\YT\\LT\  +  \W\)  (9.16) 

|A(3)|  <ln\Y\\LT\  (9.17) 

|Ay  |  <  726(|A2l||Wi2|  +  (|L2l||Wl2|)T)  +  726+3(n-b)-l|^22||  A22||L22|T.  (9.18) 


For  example,  Equation  (9.14)  is  the  result  of  applying  the  inductive  hypothesis  to  the  solution 
of  a  b  x  b  system  (the  first  step  in  the  algorithm),  and  Equation  (9.15)  is  the  result  of  applying 
Lemma  2.16  to  the  second  step  of  the  algorithm,  a  b  x  b  triangular  solve.  We  omit  the 
derivation  of  Equations  (9.16)-(9.18)  because  they  are  similar. 

Applying  (9.16)  and  (9.17)  to  (9.11),  substituting  W  =  Y LT  +  A®,  and  then  bounding 
again  yields 

|A(5)|  <  76|Yt||Lt|  +  (76  +  7n  +  767n)|Y||LT| 
and  bounding  7fe  <  7n+b  and  7fe  +  7n  +  7&7n  <  7n+fe  yields 

|A(5)|  <  jn+b\X\\LT\.  (9.19) 

Substituting  (9.15)  and  (9.19)  into  (9.12),  further  substituting  H  =  XLT  +  AA*,  and  then 
applying  (9.19)  again  yields 

|A12|  <  (76  +  7n+&  +  767n+6)(|A||X||LT|)12  <7n+26(|A||A||LT|)12.  (9.20) 
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Next  we  bound  A22  using  (9.13),  starting  with  the  first  two  terms  in  that  equation.  We 
defined  W  such  that  is  zero  and  thus 

(|I||A'3>|  +  (|L||A7)T)22  =  |t21||Af2>|  +  (|L21||A<?|)T. 

Further  applying  (9.17)  yields 

(|L||A(3)|  +  (|L||A(3)|)t)22  <  7n|^21||W,||L2i:|r  +  7n(|L2i||yi,:||L2j:|r)T 

=  7n((|L||A||LT|)22  -  |L22||A22||L22|t).  (9.21) 

Substituting  W  =  YLT  +  A*A  in  the  first  two  terms  of  (9.18)  and  using  the  same  argument 
that  we  used  to  produce  (9.21)  yields 

|^2i||Wi2|  +  (|^2i||Wi2|)T<  (l  +  7n)((|L||A||LT|)22-|L22||A22||L22r).  (9.22) 

Substituting  (9.22)  into  (9.18),  and  then  substituting  the  result  together  with  (9.21)  into 
(9.13)  yields 


I A22|  <  (7n  +  726  +  7267n)((|L| |A | |Ft|)22  —  |L22||X22||L 


22  i 


+  726+3(n— 6)  — 1 1  -^22 1 1  Af22  1 1 L 


22 1 


Because  1  <  b  <  n  —  1, 


7 n  +  72 b  +  72b7n  A  7n+2fe  A  73n-2 
72fe+3(n-6)-l  =  73n-fe-l  <  73n-2, 


and  thus 

|A22|  <73n_2(|L||A||LT|)22.  (9.23) 


Combining  bounds  (9.14),  (9.20)  and  (9.23)  we  see  that  |A IyJ 
where 


736-1  7n+26 

7n+26  73n-2 


<  C'/)j(|L||A||Lt|)/!j, 


and  because  Cj.j  <  73n_2  for  all  /  and  J,  we  conclude  that  |A|  <  73n_2(|L| |A| \LT\).  The 
exception  to  this  is  the  case  n  —  1,  which  requires  the  larger  constant  73n-i  in  the  statement 
of  the  lemma.  □ 


9.2.2  Stability  of  the  Block- Aasen  Algorithm 

We  now  show  that  the  block-Aasen  algorithm  is  backward  stable.  The  analysis  relies  on 
lemmas  from  Section  2.6  and  on  Lemma  9.1.  We  use  the  symbols  L,  T,  H  and  IT  to  denote 
the  corresponding  floating  point  matrices  and  not  their  abstract  exact  equivalents.  The 
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exception  to  this  is  the  diagonal  blocks  of  W,  which  are  not  computed  by  the  algorithm  and 
which  we  define  for  convenience  as  being  exactly 

Wjj  =  i  /i'/. 7 )./../  =  R.ul-'u. 

Similarly,  R  is  also  not  computed  by  the  algorithm  due  to  the  optimization  described  in 
Section  9.1.4.  We  define  R  as  the  block  upper-triangular  matrix  with  symmetric  diagonal 
blocks  that  satisfies  RT  +  R  =  T.  Its  superdiagonal  blocks  are  exactly  those  of  the  computed 
T  and  its  diagonal  blocks  are  obtained  from  those  of  the  computed  T  by  scaling  them  by  0.5. 

We  break  the  main  content  of  the  proof  into  two  lemmas,  bounding  the  backward  error 
in  the  off-diagonal  blocks  in  Lemma  9.2  and  bounding  the  backward  error  in  the  diagonal 
blocks  in  Lemma  9.3.  Combining  these  results,  we  obtain  the  backward  stability  of  the 
block-Aasen  algorithm,  stated  in  Theorem  9.4. 

Lemma  9.2.  The  computed  factors  satisfy  A  =  LTLT  +  A,  where 

|A/,j|  <^n+2bi\L\\I  \\LT\)liJ 


whenever  I  ^  J . 

Proof.  Let  the  matrices  Ad)  and  Ad)  be  such  that 

A  =  LH  +  A(1),  H  =  TLt  +  A(2). 
Substituting  the  second  expression  into  the  first  one  yields 

A  =  LTLt  +  LA(2)  +  A(1), 


and  thus 

A  =  A(1)  +  LA(2).  (9.24) 

Bounding  A  requires  that  we  obtain  bounds  on  Ad)  and  A(2k 

Let  us  bound  the  subdiagonal  blocks  of  Adi  by  considering  the  computation  that  corre¬ 
sponds  to  Equation  (9.4).  In  that  equation  we  form  the  matrix  X  =  Aj+1.Nj—Lj+1.Ni.jHi.jj 
and  then  compute  its  LU  factorization  X  =  Lj+1.n  j+1Hj+1)j.  Let  T(1l  and  lA2)  be  such 
that 

Aj+1:N,J  =  Lj+  jH1:J,j  +  X  +  r« 

x  =  Lj+i:7V,J+1-£/j+1,J  +  r(2). 

Substituting  the  second  expression  into  the  first  one  yields 

Aj+1-.N,J  =  Lj+1-.N,1:  +  lj+1:N,j+1hj+1ij  +  rd)  +  r(2) 

=  Lj+l:Nl:J+lHl:J+l'J  +  Tdl  + 


(9.25) 

(9.26) 
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because  Hj+2:n,j  is  zero, 

Aj+i-.n,j  —  (LH)j+1:Nj  +  r(1)  +  r(2), 

and  therefore 

Oh.N.j  =  r(1)  +  r(2).  (9.27) 

We  analyze  the  accuracy  of  forming  X  using  Lemma  2.14,  which  yields  the  bound 

|r(1)|  <  7Jb(I^J+l:iV,l:j||^l:J,j|  +  |AT|). 

However,  because  L2:n, i  is  zero,  the  inner  dimension  of  the  product  is  effec¬ 

tively  ( J  —  1)6  instead  of  Jb,  and  therefore 

|r(1)|  <  7<j-i)»(l  Lj+l:N,l:j\\Hl:Jj\  +  |A|).  (9.28) 

The  accuracy  of  the  LU  factorization  of  X  can  be  analyzed  using  Lemma  2.17,  which  yields 

|r(2)|  <  'yb\Lj+i:N,j+i\\Hj+iij\.  (9.29) 

Substituting  (9.28)  and  (9.29)  into  (9.27)  yields 

|Aj+l:JV,jl  <  'y(J-l)b(\Lj+l:N,l-.j\\Hl-.J,j\  +  |-^"|)  +  lb  \  L  J+l-.N,  J+l  1 1  Hj+l,  J  \ , 

and  further  substituting  (9.26)  and  using  (9.29)  again  yields 

I^J+l:AT,jl  A  7(J-l)6|-^J+l:JV,l:j||-H'l:J,j|  +  (l(  J-l)b  +  lb  +  7(  J-l)blb)  \  L  J+l-.N,  J+l  \  \  #J+1,  J  \ . 
Bounding  the  constants  in  this  expression  according  to 

7(J-1)6  <  7 Jb 
7(J-1)6  +  7 b  +  l(J-l)blb  <  7 Jb, 

where  the  second  bound  is  justified  by  Lemma  2.12,  yields 

I^J+l:JV,jl  —  'lfJb\Lj+l:N,l:j\\Hi:jtj\  +  Jjb\L  J+l-.N, J+l  |  \Hj+l,j\ 

=  lfJb(\L\\H\)j+l:N,J 

and  therefore 

\A^\<lJb(\L\\H\)Itj  (9.30) 

for  all  I  >  J. 

We  bound  the  diagonal  and  superdiagonal  blocks  of  by  considering  the  computation 
that  corresponds  to  Equation  (9.3).  In  that  equation  we  compute  blocks  of  H  by  forming 
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the  corresponding  blocks  of  TLt,  and  as  we  discuss  in  Section  9.1.4,  these  blocks  are  formed 
according  to  the  formula 

Hi,j  —  Tij-i(Ljj-i)t  +  TjALjjf  +  Tjj+i(Ljj+i)t. 

This  is  equivalent  to  multiplying  the  b x 3b  matrix  Tjj_1:I+l  by  the  3 bxb  matrix  (L 
and  the  accuracy  of  this  computation  is  bounded  in  Lemma  2.13,  which  guarantees  that 

|A«|  <73t(|T||LT|)„ 

/Q\ 

for  all  /  <  J.  The  blocks  Ay_j_x  j  correspond  to  Equation  (9.5),  which  solves  the  triangular 
system  Hj+ij  =  Tj+ij(Ljyj)T .  This  is  analyzed  in  Lemma  2.16,  which  guarantees  that 

|A$J  <75(|T||Lt|)j+i,j. 

All  other  blocks  of  A ^  are  zero,  and  thus 

|A(2)|  <  73fe|T||LT|.  (9.31) 

Substituting  (9.30)  and  (9.31)  into  (9.24)  yields 

|A7,J|<7jfc(|i||^|)/,j  +  736(|^im|Lr|)J|J 

for  all  I  >  J.  Further  substituting  H  =  TLT  +  and  using  (9.31)  once  again  yields 

|Ayj|  <  (' fJb  +  73 b  +  lJbl3b)(\L\\I ]\LJ  | )/,j  <  7jf,+3b(|L|j7  ||L7  \)ItJ. 

The  constant  Jjb+sb  is  maximized  when  J  =  N  —  1,  which  yields  the  required  bound  for  all 
I  >  J .  As  for  I  <  J,  the  same  bound  holds  because  A  is  the  difference  of  the  two  symmetric 
matrices  A  and  LTLT  and  is  thus  itself  symmetric.  □ 

Lemma  9.3.  The  computed  factors  satisfy  A  =  LTLT  +  A,  where 

|Ayj|  <  72n-6-l(|A||  1  j|L7  |) jtj 

for  J  =  1,  2,  . . . ,  N. 

Proof.  Let  the  matrices  A^  and  A1-2)  be  such  that 

A  =  LW  +  (LW)T  +  a(1),  w  =  rlt  +  a(2). 

Substituting  the  second  expression  into  the  first  one  yields 

A  =  LRLt  +  ( LRLT)T  +  LA(2)  +  (LA(2))t  +  A(1) 

=  L(R  +  Rt)Lt  +  LA(2)  +  (LA(2))t  +  A(1) 

=  LTLt  +  LA(2)  +  (LA(2))t  +  A(1) 
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and  therefore 

A  =  A(1)  +  LA(2)  +  (LA(2))t  (9.32) 

Equation  (9.2)  computes  Tj;j  by  forming  X  = 
and  then  solving  LjijTjij(Ljij)2  =  X.  Let  T^1)  and  T(2^  be  such  that 

A  j,j  =  i  W1:J.hJ  +  ,1b,,;  +  X  +  r(1)  (9.33) 

X  =  LjtjTj,j(Lj,j)T  +  r^.  (9.34) 

Substituting  (9.34)  into  (9.33)  yields 

A.u  =  ill,,/  ,../  +  (Lj.nj-rWnj-gjf  +  7./../ (7- ./../) 7  +  Ln  +  P2). 

Rewriting  the  term  Lj)jTjj{Lj)j)t  according  to 

7/j,jTj!j(Lj)j)t  =  Lj)j(7?j)j  +  ( Rjtj)T)(yLj)j)T 

=  LjjRJyj(Ljtj)T  +  (LjjRj,j(Ljj)t)t 

=  i  >./../  +  i/../../ir./../)7 

gives 

Aj,j  =  /..A,,/  ,11,:./  ,../  +  (Lynj-ilEnj-yjf  +  Lj.jWj.j  +  (Lj^jf  +  r(1)  +  r(2) 

=  1  El,/../  +  (  A./.n./U-,:./../)7'  +  r(1)  +  r{2) 

=  (LhL  +  ( LW)T)JyJ  +  r(1)  +  r® 

and  thus 

aJJ  =  r(1)  +  r(2).  (9.35) 

The  accuracy  of  forming  A"  and  then  solving  for  Tj  j  can  be  bounded  using  Lemmas  2.15 
and  9.1,  which  guarantee  that 

|r  11 1  <  72(J— 2)&(  |  Ljij—i  |  j  1 1  1;  J— 1,  J  |  +  (|7/Jil:J_l||hFi:j_i)j|)T  +  |A|) 

|r(2)|  <'Y3b-i\Ljij\\Tjtj\\Ljtj\T. 

Substituting  these  bounds  into  (9.35),  further  substituting  (9.34)  and  bounding  again  yields 

|Aj^j|  <  72(J-2)fe(|Ej,l;,/_l||lLl:J-l,j|  +  (\L J,l:J-l\\Wi:j-ij\)T) 

+  {l2(J-2)b  +  73b— 1  +  72(J-2)673fe-l)|^J,j||rj,j||Ljij|  7 
A  72(J-2)fe(|7vJ,l;J_l||(Ll:J-l,j|  +  (|7/J,1:J-i||LEi;j_iij|)T) 

+  72(j-2)b+36-i|^j,j!|rj,j||Ljij|:/ ,  (9.36) 


which  is  the  bound  we  require  for  A^h 
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The  superdiagonal  blocks  of  A ^  correspond  to  Equation  (9.1).  That  equation  states 
that  blocks  of  W  are  computed  by  forming  the  corresponding  blocks  of  the  product  RLT , 
which  are  formed  according  to  the  formula 

Wi,j  =  0.5  (T^iLjjf)  +  Tij+1(LjJ+1)t , 

as  we  explain  in  Section  9.1.4.  Because  of  the  scaling  by  0.5  we  cannot  apply  Lemma  2.13 
directly  to  this  formula.  Instead  we  must  bound  the  errors  resulting  from  forming  the  two 
single-block  products  separately,  and  then  use  assumptions  (2.4)  and  (2.3)  to  account  for  the 
effects  of  scaling  and  summation  respectively.  We  skip  the  details;  the  resulting  bound  is 

\A^\<lb+1(\R\\LT\)ItJ  (9.37) 

for  all  I  <  J. 

Next  we  return  to  bounding  (9.32),  starting  with  the  last  two  terms.  The  diagonal  and 
subdiagonal  blocks  of  IT  are  defined  such  that  the  corresponding  blocks  of  A1'2'1  are  zero,  and 
therefore 

(|L||A'2»|  +  (|L||A«|))„  =  |L.,,1;,,_1||A£>_M|  +  (l^w-illA®.,  J)T 
Substituting  (9.37)  yields 

(|L||A^|  +  (|L||A(~,|))jij  <  76+i(|Lj,i:j-1||-Ri:j-i,1:j||Lj,1:j|T 

+  (|-^J,l:J-l||-Rl:J-l,l:j||-^J,l:j|T)T), 

which  can  be  further  simplified  according  to 

|-Rl:J-l,l:j|  |-^J,l:j|T  +  ( |  L  J;l:  j-i  1 1  R\:  J  \  \  L  j  |T)T 

J-l  J- 1  J-l 

=  2  ^  \Ljj\\RIj\\LJj\t  +  ^  \Ljj\\Rij+i\\Lj:I+i\7  +  ^  \Ljj+i\\RIyI+i\  J  \Ljj \7 

i=i  i=i  i=i 

j-i  j-i  j-i 

=  ^  \Ljj\\TIj\\LJj\'1  +  ^  \Ljj\\TIj+1\\Ljj+1\1  +  ^  \LJtI+i\\TI+ij\\Ljj\J 

i=i  i=i  i=i 

—  (,\L\\T\\L\T)  Jtj  —  \Ljtj\\Tjtj\\Ljtj\T, 

yielding 

(|L||A'2>|  +  (|r||A^2*|)r)j,j  <  7(,+i((|i||r||iT|)w  -  \Ljj\\T„\\LjAT)-  (9.38) 

To  bound  the  first  term  in  (9.36)  we  substitute  IT  =  RLT  +  A(2^  and  apply  the  same 
arguments  we  used  to  produce  (9.38),  obtaining 

|LJ.1:J_1||W1:J_1,J|  +  (|LJ!1;J_1||W1:J_1,J|)T<  (l  +  7fe+1)((|L||T||LT|)^ 

—  \Lj!j\\Tj!j\\Ljj\T). 


(9.39) 
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Finally,  substituting  (9.39)  into  (9.36)  and  then  substituting  the  result  together  with  (9.38) 
into  (9.32)  yields 

|Aj,j|  <  (72(J-2)6  +  76+1  +  72(J-2)&7b+l)((|-^ll^'l|-^T|)j,J  —  \Lj,j\ \Tj,j\  |Lj;j|T) 

+  nf2(J-2)b+3b-l\Lj,j\\Tjtj\\Lj!j\T . 

Because  b  >  1,  we  can  bound 


72(J-2)6  +  7b+l  +  72(J— 2)676+1  —  72(J— 2)6+6+ 1  <  72(  J-2)6+36-l , 
which  allows  us  to  cancel  the  two  instances  of  |-bj,j||Tj>j| \Lj^\7  and  obtain 

|Ayj|  <  72(J-2)6+36-l(|^||-h||LT|)jij. 

The  constant  72(j_2)6+36-i  is  maximized  when  J  =  N,  which  yields  the  required  bound.  □ 
With  these  lemmas,  we  can  now  state  the  backward  stability  of  the  overall  algorithm. 
Theorem  9.4.  The  computed  factors  satisfy  A  =  LTLT  +  A,  where 

|A|  <  ^2n-b-l\L\\T\\LT\ 


if  n  >  3 b  and 


|A|  <7n+26|L||T||LT 


otherwise. 

Proof.  Lemmas  9.2  and  9.3  state  that 

|A/,j|  <  ,yn+2b(\L\\  l  | LT\)IyJ 

whenever  I  ^  J,  and 

|A/,j|  <  72n-b-l{\L\\d  \\L7  Dpj 

whenever  I  =  J,  and  therefore  the  bound 

|A7>j|  <  max{7n+26  ,  72n-6-i}  {\L\\T\\LT\)i,j 


holds  for  all  /  and  J.  The  quantity  ryn  increases  monotonically  with  n  (so  long  as  nu  <  1)  and 
therefore  72n_ft_i  >  7n+26  whenever  2n  —  b  —  1  >  n  +  2b,  which  occurs  whenever  n  >  3b.  □ 
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9.2.3  Growth 

The  stability  of  the  factorization  algorithm  depends  on  the  magnitude  of  L  and  T  relative  to 
that  of  A.  How  large  can  L  and  T  get?  In  the  element-wise  Aasen  algorithm,  the  magnitude 
of  elements  of  L  is  bounded  by  1  (because  the  algorithm  uses  partial  pivoting),  and  it  is  easy 
to  show  that  TtJ  <  4n~2  niaxtJ  \  AtJ  [86];  the  argument  is  essentially  the  same  as  the  one  that 
establishes  the  bound  on  UVJ  |  in  Gaussian  elimination  with  partial  pivoting.  Furthermore, 
this  bound  is  attained  by  a  known  matrix  of  order  n  =  3,  although  larger  matrices  that 
attain  the  bound  are  not  known  [85,  p.  224], 

It  is  important  to  interpret  this  bound  correctly.  The  actual  expression  4n~2  is  not 
important,  because  it  does  not  indicate  the  growth  that  is  normally  attained.  The  same 
is  true  for  LU  with  partial  pivoting;  it  is  stable  in  spite  of  the  fact  that  the  growth  factor 
bound  is  as  large  as  2n_1,  not  because  this  bound  is  small  (it  is  not;  it  is  huge).  Two 
other  things  are  important.  One  is  that  the  bound  shows  that  growth  is  not  related  to  the 
condition  number  of  A.  The  second  is  that  growth  in  practice  is  small.  The  reasons  for  this 
are  complex  and  not  completely  understood  even  in  LU  with  partial  pivoting,  but  this  is  the 
reality;  for  deeper  analyses  and  discussion,  see  [128,  144]  and  [85,  Section  9.4], 

If  we  compute  the  factorization  in  Equation  (9.4)  using  LU  with  partial  pivoting  (GEPP), 
essentially  the  same  bounds  hold  for  our  block  algorithm.  The  block  columns  of  the  L  factor 
are  generated  by  Gaussian  elimination  with  partial  pivoting,  so  the  same  two-sided  doubling- 
up  argument  shows  that  the  growth  factor  for  T  is  bounded  by  4n~6-1  (since  the  first  columns 
of  L  are  unit  vectors  and  additions/subtractions  start  only  in  column  6  +  1).  We  provide 
numerical  experiments  in  Section  9.4  to  illustrate  the  backward  stability  and  growth  using 
GEPP  within  the  block-Aasen  algorithm  in  practice. 

When  the  factorization  in  Equation  (9.4)  is  computed  in  a  communication-avoiding  way 
using  the  tall-skinny  LU  factorization  [80]  (TSLU),  L  is  still  bounded,  but  the  bound  is  2bh, 
where  h  is  a  parameter  of  TSLU  that  normally  satisfies  h  =  0(\og  n).  This  can  obviously 
be  much  larger  than  1,  although  experiments  indicate  that  L  is  usually  much  smaller.  This 
implies  that  growth  in  T  is  still  bounded,  but  the  bound  is  now  4nbh .  This  is  worse  than  with 
GEPP,  but  as  we  explained  above,  this  theoretical  bound  is  not  what  normally  governs  the 
stability  of  the  algorithm.  We  leave  numerical  experiments  with  TSLLI  in  the  block-Aasen 
algorithm  to  future  work.  Also,  the  recently-developed  panel  rank-revealing  LU  factoriza¬ 
tion  [100]  may  improve  the  growth  bounds  in  our  algorithm,  but  we  have  not  fully  explored 
this. 


9.3  Sequential  Complexity  Analyses 

In  this  section  we  analyze  the  costs  of  the  sequential  algorithm.  We  begin  with  an  analysis 
of  the  computational  complexity  and  then  analyze  the  communication  costs  of  the  algorithm 
in  the  sequential  model. 
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9.3.1  Computational  Cost 

Our  goal  in  this  subsection  is  to  show  that  the  new  blocked  algorithm  performs  the  same 
number  of  arithmetic  operations  as  the  element-wise  one,  np  to  lower  order  terms. 

In  order  to  determine  the  arithmetic  complexity  of  the  algorithm,  we  consider  only  Equa¬ 
tions  (9.1)-(9.5)  (the  computational  cost  of  pivoting  is  negligible).  Letting  J  denote  the 
index  of  the  outermost  loop  of  the  algorithm  and  6  be  the  block  size,  the  arithmetic  cost 
of  Equations  (9.1)-(9.3)  is  0(J63)  flops.  This  is  because  each  equation  involves  O(J)  block 
multiplications  of  b  x  b  blocks  (some  of  which  are  triangular).  Note  that  in  Equation  (9.2), 
the  dominant  cost  is  in  computing  the  product  of  the  block  row  of  L  with  the  block  column  of 
W;  the  arithmetic  cost  of  the  two-sided  symmetric  solve  is  0(6 3).  Similarly,  Equation  (9.5) 
is  a  triangular  solve  involving  one  block  and  has  an  arithmetic  cost  of  0(63).  The  dominant 
arithmetic  cost  for  the  block-Aasen  algorithm  comes  from  Equation  (9.4),  which  consists  of 
two  subcomputations:  a  matrix  multiplication  involving  L  and  H  and  an  LU  decomposition 
of  a  block  column.  The  arithmetic  cost  of  the  LU  decomposition  is  0(Jb 3).  The  matrix 
multiplication  step  multiplies  an  (N  —  J)b  x  Jb  submatrix  of  L  by  a .  Jb  x  b  submatrix  of  H . 
At  the  Jth  step  of  the  algorithm,  this  arithmetic  cost  is  2{N  —  J)Jb 3  flops,  ignoring  lower 
order  terms.  Summing  over  the  outermost  loop  and  using  the  fact  that  N  —  n/b,  we  have  a 
total  arithmetic  cost  of 


1 

V  (2(N  -  J)Jb 3  +  0(J63))  =  -n3  +  o(n3). 

o 


9.3.2  Communication  Costs 

To  determine  the  communication  complexity  of  the  algorithm,  we  must  consider  Equa¬ 
tions  (9.1)-(9.5)  as  well  as  the  cost  of  applying  symmetric  permutations  to  the  trailing 
matrix.  We  analyze  the  three  parts  of  the  algorithm  separately:  block  operations  (all  of 
the  computations  described  in  Equations  (9.1)-(9.5)  with  the  exception  of  the  LU  decom¬ 
position),  LU  decomposition  of  block  columns,  and  application  of  the  permutations.  We 
assume  the  matrix  is  stored  in  block-contiguous  format  with  block  size  b ,  the  same  as  the 
algorithmic  block  size.  In  block  contiguous  format,  6x6  blocks  are  stored  contiguously  in 
memory  (see  Section  2.3.1.  We  assume  column-major  ordering  of  elements  within  blocks 
and  of  the  blocks  themselves.  In  the  following  analysis,  we  assume  6  <  y/ M/3  so  that  three 
blocks  fit  simultaneously  in  fast  memory. 

9.3. 2.1  Block  Operations 

By  excluding  the  LU  decomposition,  all  the  other  computations  in  Equations  (9.1)-(9.5) 
involve  block  operations-either  block  multiplication  (sometimes  involving  triangular  or  sym¬ 
metric  matrices),  block  triangular  solve,  or  block  two-sided  symmetric  triangular  solve.  For 
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Algorithm 

Words 

Messages 

RLU 

SMLU  [32] 

TSLU  [80] 

O  (^  +  nMog&) 
O^+nblogblog  g) 

°(t6) 

0  (min  (n,  g)) 

o(mL  +  £  log*  logs) 

°(A) 

Table  9.1:  Communication  costs  of  LU  decomposition  algorithms  applied  to  an  nxb  matrix 
stored  in  b  x  b  block  contiguous  storage,  assuming  b  <  yjM/Z. 


example,  in  Equation  (9.3),  we  compute 

Hitj  =  Tii_i(Lji_i)t  +  Tjj(Ljj)t  +  (TI+lnI)T(LI+ltJ)T 

(assuming  only  the  lower  halves  of  T  and  L  are  stored).  Each  of  the  three  multiplications 
involve  b  x  b  blocks,  so  by  the  assumption  that  b  <  y/ M/3,  the  operations  can  be  per¬ 
formed  by  reading  contiguous  input  blocks  of  size  b2  words  into  fast  memory,  performing 
0(b3)  floating  point  operations,  and  then  writing  the  output  block  back  to  slow  memory. 
This  implies  that  the  number  of  messages  is  proportional  to  the  number  of  block  opera¬ 
tions,  which  is  0( (computational  cost) /b3)  =  0(n3/b 3)  and  the  number  of  words  moved  is 
0( (computational  cost)/6)  =  0(n3/b). 

9. 3. 2. 2  Panel  Decomposition 

We  now  consider  algorithms  for  the  LU  decomposition  of  the  column  panel.  Note  that 
the  O(N)  LU  factorizations,  each  involving  0(Nb3)  flops,  contribute  altogether  only  an 
0{n2b)  term  to  the  computational  complexity  of  the  overall  block-Aasen  algorithm,  a  lower 
order  term.  Thus,  attaining  the  communication  lower  bound  for  the  overall  algorithm  does 
not  require  attaining  optimal  data  re-use  within  panel  factorizations.  For  example,  using 
a  naive  algorithm  and  achieving  only  constant  re-use  of  data  during  the  LU  factorization 
translates  to  a  total  of  0(n2b )  words  moved  during  LU  factorizations,  which  is  dominated 
by  the  communication  complexity  of  the  block  operations,  0(n3/b )  words,  in  the  case  where 
n  >  62.  However,  to  ensure  that  both  bandwidth  and  latency  costs  of  the  LU  factorizations 
do  not  asymptotically  exceed  the  costs  of  the  rest  of  the  block-Aasen  algorithm  for  all 
matrix  dimensions,  we  need  algorithms  that  achieve  better  than  constant  re-use  (though  the 
algorithms  need  not  be  asymptotically  optimal). 

We  choose  to  use  the  recursive  algorithm  (RLU)  of  [83,  142]  for  panel  factorizations, 
updated  slightly  to  match  the  block-contiguous  data  layout.  The  algorithm  works  by  splitting 
the  matrix  into  left  and  right  halves,  factoring  the  left  half  recursively,  updating  the  right 
half,  and  then  factoring  the  trailing  matrix  in  the  right  half  recursively.  In  order  to  match 
the  block-contiguous  layout,  the  update  of  the  right  half  (consisting  of  a  triangular  solve 
and  matrix  multiplication)  should  be  performed  block  by  block.  The  bandwidth  cost  of  this 
algorithm  for  nxb  matrices  is  analyzed  in  [142],  and  the  latency  cost  can  be  bounded  by 
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Algorithm 

Words 

Messages 

Direct 

Blocked 

0(nb ) 
0(n2) 

0(nb) 

oh) 

1 

Tabic  9.2:  Communication  costs  of  symmetric  pivoting  schemes 

the  recurrence  L(n,b )  <  2L(n,b/2)  +  O(N).  The  O(N)  term  comes  from  the  update  of  the 
right  half  of  the  matrix,  which  involves  reading  contiguous  chunks  of  each  of  the  O(N)  blocks 
in  the  panel.  The  base  case  occurs  either  when  the  subpanel  fits  in  memory  ( nb  <  VI)  or 
when  6  =  1.  The  cost  of  the  recursive  algorithm  is  dominated  by  its  leaves,  each  of  which 
requires  O(N)  messages.  Depending  on  the  relative  sizes  of  n  and  M,  there  are  either  nb/M 
or  6  leaves  starting  with  an  n  x  6  matrix.  The  latency  cost  becomes  the  minimum  of  two 
terms:  O(n)  or  0(n2/M).  The  bandwidth  and  latency  costs  are  summarized  in  the  first  row 
of  Table  9.1. 

In  order  to  determine  the  contribution  of  LU  factorizations  to  the  costs  of  the  overall 
block-Aasen  algorithm,  we  must  multiply  the  cost  of  the  n  x  6  factorization  by  N,  the 
number  of  panel  factorizations.  Using  the  RLU  algorithm,  this  yields  a  bandwidth  cost  of 
0(n2b/ \[M +n 2  log  6)  words  and  a  latency  cost  of  0(min(n2/6,  n3/ (bM))  messages.  With  the 
exception  of  the  0(n2  log  6)  term  in  the  bandwidth  cost,  these  costs  are  always  asymptotically 
dominated  by  the  costs  of  the  block  operations. 

While  the  RLU  algorithm  is  sufficient  for  minimizing  communication  in  the  block-Aasen 
algorithm,  there  are  algorithms  which  require  fewer  messages  communicated.  The  Shape- 
Morphing  LU  algorithm  (SMLU)  [32]  is  an  adaptation  of  RLU  that  changes  the  matrix  layout 
on  the  fly  to  reduce  latency  cost.  The  algorithm  and  its  analysis  are  provided  in  [32],  and  the 
communication  costs  are  given  in  the  second  row  of  Table  9.1.  SMLLI  uses  partial  pivoting 
and  incurs  a  slight  bandwidth  cost  overhead  compared  to  RLU  (an  extra  logarithmic  factor  on 
one  term).  Another  algorithm  which  reduces  latency  cost  even  further  is  the  communication¬ 
avoiding  tall-skinny  LLI  algorithm  (TSLU)  [80],  as  described  in  Section  7.1.4.  The  algorithm 
can  be  applied  to  general  matrices,  but  the  main  innovation  focuses  on  tall-skinny  matrices. 
TSLLI  uses  tournament  pivoting,  a  different  scheme  than  partial  pivoting,  which  has  slightly 
weaker  theoretical  numerical  stability  properties.  The  algorithm  and  analysis  are  provided  in 
[80],  and  the  communication  costs  are  given  in  the  third  row  of  Table  9.1.  The  communication 
costs  of  TSLLI  are  optimal  with  respect  to  each  panel  factorization. 

9.3. 2.3  Applying  Symmetric  Permutations 

After  each  LU  decomposition  of  a  block  column,  we  apply  the  internal  permutation  to 
the  rest  of  the  matrix.  This  permutation  involves  back-pivoting,  or  swapping  rows  of  the 
already  factored  L  matrix,  and  forward-pivoting  of  the  trailing  symmetric  matrix.  Applying 
the  symmetric  permutations  to  the  trailing  matrix  includes  swapping  elements  within  a  given 
set  of  rows  and  columns,  as  shown  in  Figure  9.12.  For  example,  applying  the  transposition 
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Figure  9.12:  Exchanging  rows  and  columns  k  and  l.  The  second  (dark)  block  column  is 
the  block  column  of  the  reduced  matrix  whose  LU  factorization  was  just  computed.  The 
Erst  block  column  is  a  block  column  of  L;  the  algorithm  applies  back  pivoting  to  it.  Block 
columns  3  to  6  are  part  of  the  trailing  submatrix;  the  algorithm  applies  forward  pivoting  to 
them.  The  trailing  submatrix  is  square  and  symmetric,  but  only  its  lower  triangle  is  stored, 
so  a  row  that  needs  to  be  exchanged  is  represented  as  a  partial  row  (up  to  the  diagonal)  and 
a  partial  column,  as  shown  here. 


(k,l)  implies  that  the  L-shaped  set  of  elements  in  the  kth  row  and  kth  column  (to  the  left 
and  below  the  diagonal)  is  swapped  with  the  L-shaped  set  of  elements  in  the  /th  row  and 
/th  column,  such  that  element  a^k  is  swapped  with  element  an  and  element  aik  stays  in  place 
(see  Figure  9.12). 

Since  there  are  at  most  b  swaps  that  must  be  performed  for  a  given  LU  decomposition, 
and  each  swap  consists  of  0(n )  data,  the  direct  approach  of  swapping  L-shaped  sets  of 
elements  one  at  a  time  has  a  bandwidth  cost  of  0(nb)  words.  However,  no  matter  how 
individual  elements  within  blocks  are  stored,  because  the  permutations  involve  accessing 
both  rows  and  columns,  at  least  half  of  the  elements  will  be  accessed  non-contiguously,  so 
the  latency  cost  of  the  direct  approach  is  also  0(nb )  messages.  Since  there  are  N  =  n/b 
symmetric  permutations  to  be  applied,  these  costs  amount  to  a  total  of  0(n2)  words  and 
0(n2)  messages.  While  the  bandwidth  cost  is  a  lower  order  term  with  respect  to  the  rest  of 
the  algorithm,  the  latency  cost  of  the  permutations  exceeds  the  rest  of  the  algorithm,  except 
when  n  M3/2.  This  approach  is  the  symmetric  analogue  of  Variant  1  in  [80]. 

In  order  to  reduce  the  latency  cost,  we  use  a  blocked  approach  which  will  require  greater 
bandwidth  cost  than  the  direct  approach  but  will  not  increase  the  asymptotic  bandwidth 
cost  of  the  block-Aasen  algorithm.  The  blocked  approach  accesses  contiguous  b  x  b  blocks, 
but  it  may  permute  only  a  few  rows  or  columns  of  the  blocks.  This  approach  is  the  symmetric 
analogue  of  Variant  2  in  [80]. 
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Figure  9.13:  Exchanging  a  pair  of  rows  in  the  blocked  approach.  Numbers  indicate  sets  of 
blocks  that  are  held  simultaneously  in  fast  memory:  all  the  blocks  marked  “1”  are  held  in 
fast  memory  simultaneously,  later  all  the  blocks  marked  “2” ,  and  so  on. 


The  algorithm  works  as  follows:  for  each  block  in  the  LU  factorization  panel  that  includes 
a  permuted  row,  we  update  the  N  pairs  of  blocks  shown  in  Figure  9.13.  The  updates 
include  back-pivoting  (updating  parts  of  the  L  matrix  that  have  already  been  computed) 
and  forward-pivoting  (updating  the  trailing  matrix).  Nearly  all  the  updates  involve  pairs  of 
blocks,  which  fit  in  fast  memory  simultaneously.  Pairs  of  blocks  involved  in  back-pivoting  are 
not  affected  by  column  permutations  and  swap  only  rows  (those  marked  “1”  in  Figure  9.13). 
Some  pairs  of  blocks  involved  in  forward-pivoting  are  not  affected  by  row  permutations  and 
swap  only  columns  (those  marked  “2”  in  Figure  9.13).  Because  only  half  of  the  matrix  is 
stored,  some  pairs  of  blocks  in  the  trailing  matrix  will  swap  columns  for  rows  (those  marked 
“3”  in  Figure  9.13).  The  more  complicated  updates  involve  blocks  that  are  affected  by  both 
row  and  column  permutations:  the  two  diagonal  blocks  and  the  corresponding  off-diagonal 
block,  marked  “4”  in  Figure  9.13.  In  order  to  apply  the  two-sided  permutation  to  these 
blocks,  all  three  blocks  are  read  into  fast  memory  and  updated  at  once.  Since  there  are 
O(N)  blocks  in  each  LU  factorization  panel,  and  each  block  with  a  permuted  row  requires 
accessing  O(N)  blocks  to  apply  the  symmetric  permutation,  for  a  given  LU  factorization, 
the  number  of  words  moved  in  applying  the  associated  permutation  is  0(N2b2)  =  0(n 2), 
and  the  total  number  of  messages  moved  is  0(N2)  =  0(n2 /b2).  The  communication  costs 
of  the  two  approaches  are  summarized  in  Table  9.2. 

9.3. 2.4  Communication  Optimality  of  the  Block-Aasen  Algorithm 

Combining  the  analysis  of  the  three  sections  above,  we  obtain  the  communication  costs  of 
the  overall  algorithm.  Assuming  block-contiguous  layout,  the  communication  costs  of  the 
block  operations  are  0{n3 /b )  words  and  0(n3/b3)  messages,  the  costs  of  the  panel  factoriza¬ 
tions  using  the  RLU  algorithm  are  0{n2b/ \fM  +  n2  log  b )  words  and  0(min{n2/6,  n3 / (bM)}) 
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messages,  and  the  costs  of  the  pivoting  using  the  blocked  approach  are  0(n3/b )  words  and 
0(n3/b 3)  messages.  Choosing  a  block  size  of  b  =  @(a/M),  we  obtain  communication  costs 
of  the  block-Aasen  algorithm  of 

/  n3 

W  =  0  y—=+n2\ogM 


and 


S 


Given  the  communication  lower  bound  for  this  factorization  (Corollary  4.14),  this  algo¬ 
rithm  is  communication  optimal  except  in  the  tiny  range  M  <C  n2  Mlog2M.  If  TSLU  is 
used  for  the  panel  factorization,  the  algorithm  is  optimal  for  all  n. 


9.4  Numerical  Experiments 

Next  we  describe  a  set  of  numerical  experiments  that  provide  further  insight  into  the  nu¬ 
merical  behavior  of  the  algorithm.  We  used  a  block  size  b  =  16  in  all  the  experiments. 

We  carried  out  two  sets  of  experiments:  one  set  involving  random  matrices  and  another 
involving  matrices  from  the  University  of  Florida  Sparse  Matrix  Collection  [55].  In  the  first 
set  of  experiments  we  generated  a  sequence  of  random  square  symmetric  matrices  of  order  n 
for  100  distinct  values  of  n,  linearly  spaced  in  the  interval  100  <  n  <  5,000.  The  elements  of 
these  matrices  are  distributed  normally  and  independently  (preserving  symmetry,  of  course) 
with  mean  0  and  standard  deviation  1.  In  all  of  our  experiments  we  used  GEPP  for  panel 
factorizations.  We  leave  experiments  using  the  tall-skinny  LU  (TSLU)  algorithm  to  future 
work.  For  each  matrix  we  measured  three  parameters:  the  growth  factor,  the  backward  error 
of  the  factorization,  and  the  backward  error  in  the  solution  of  a  linear  system  of  equations 
Ax  =  b,  where  b  is  the  sum  of  the  columns  of  A  (so  x  is  the  vector  of  all  ones). 

We  define  the  growth  factor  as  the  number 

|L|  |T|  |L|r 

_ OO 

Plloo  ’ 

a  definition  that  is  justified  by  Theorem  9.4.  The  factorization  error  is  defined  as 

\PAPT  -  LTLt\.  . 

max  — t - r — , 

iJ  (|L||T||L|t) 

'  '  hj 

using  the  convention  0/0  =  0.  We  compute  the  backward  error  of  the  floating  point  solution 
x  to  the  system  Ax  =  b  according  to 

P^-frlloo 

PlloolPloo+INoo’ 
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Figure  9.14:  Backward  errors  in  the  solution  of  Ax  =  b  and  the  growth  factors  in  the 
factorization  of  random  matrices.  The  matrices  are  ordered  by  growth. 


n 

K 

nnz/n 

minimum 

64 

5.0  x  10° 

0.46 

1st  quartile 

800 

8.4  x  103 

6.81 

median 

2,000 

2.5  x  106 

12.94 

3rd  quartile 

4,581 

3.2  x  1010 

23.93 

maximum 

8,140 

inf 

378.19 

Table  9.3:  Statistics  of  the  University  of  Florida  matrix  set. 


where  we  use  an  unsymmetric  LU  algorithm  to  solve  the  band  system. 

The  factorizations  of  random  matrices  were  completely  backward  stable,  with  backward 
errors  between  l.lw  and  2.4w,  with  a  median  of  1.9m.  The  stability  of  solutions  to  linear 
systems  and  the  growth  factors  are  shown  in  Figure  9.14.  The  backward  errors  are  moderate, 
varying  between  1.7  x  10-15  and  1.7  x  10-14.  The  backward  error  is  increasing  with  n  but 
at  a  rate  that  is  clearly  slower  than  linear.  The  growth  factor  is  strongly  correlated  with 
the  error,  which  is  consistent  with  the  bound  in  Theorem  9.4.  The  error  does  not  seem  to 
depend  on  n  outside  of  the  implicit  dependence  through  the  growth  factor,  in  contrast  with 
the  bound  in  Theorem  9.4. 
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Figure  9.15:  Backward  errors  and  growth  factors  on  matrices  from  the  University  of  Florida 
Collection.  The  matrices  are  ordered  by  growth. 


The  second  set  of  experiments  factored  180  matrices  from  the  University  of  Florida  Sparse 
Matrix  Collection.  We  chose  for  this  experiment  all  of  the  symmetric,  real,  non-binary 
matrices  of  order  64  <  n  <  8,192,  with  the  exception  of  matrices  with  bandwidth  b  or  less. 
Matrices  with  low  bandwidth  were  omitted  because  they  are  their  own  T  factors  and  therefore 
do  not  require  factorization.  This  set  of  180  matrices  is  further  described  in  Table  9.3.  The 
experiment  was  conducted  according  to  the  same  scheme  as  the  experiment  involving  the 
random  matrices. 

The  algorithm  experienced  difficulties  on  14  of  the  matrices;  they  are  discussed  later  in 
this  section. 

On  the  166  matrices  on  which  the  algorithm  produced  good  results,  we  obtained  stable 
factorizations  with  backward  errors  of  less  than  llw.  The  stability  of  the  linear  solver  and 
the  growth  are  shown  in  Figure  9.15.  The  linear-solver  backward  errors  are  in  the  interval 
[1.8  x  1CT18,2.0  x  1CT13]  with  a  median  of  2.1  x  10~15. 

On  14  matrices,  the  linear  solver  that  we  used  to  solve  banded  systems  involving  T  failed 
to  produce  a  solution.  For  solving  such  systems  we  use  the  LAPACK  subroutine  gbsv,  which 
is  a  banded  implementation  of  GEPP.  The  source  of  the  problem  is  that  when  T  is  rank 
deficient,  gbsv  produces  a  U  factor  with  zeros  on  the  diagonal,  and  this  factor  cannot  be  used 
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to  solve  linear  systems.  In  all  14  matrices  the  root  cause  was  structural  or  numerical  rank 
deficiency  of  A  (Matlab  reported  condition  numbers  larger  than  1020).  Our  factorization 
algorithm  produced  stable  factorizations,  with  backward  errors  of  order  u,  well  conditioned 
L’s,  and  mild  growth  (up  to  1.2  x  106). 

9.5  Conclusions 

We  have  shown  that  a  block  variant  of  Aasen’s  factorization  algorithm  can  reduce  a  sym¬ 
metric  matrix  into  a  symmetric  banded  form  in  a  communication-avoiding  way.  No  prior 
symmetric  reduction  algorithm  achieves  similar  efficiency  bounds.  We  show  in  [15]  that  the 
shared-memory  parallel  algorithm  performs  well  in  practice  on  a  multi-core  machine;  here  we 
focused  on  complete  analyses  of  the  sequential  algorithm’s  communication  costs,  arithmetic 
costs,  and  numerical  stability. 
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Chapter  10 

Communication- Avoiding  Successive 
Band  Reduction 


In  this  chapter,  we  present  new  sequential  and  parallel  algorithms  for  tridiagonalizing  a 
symmetric  band  matrix  in  order  to  compute  its  eigendecomposition.  In  order  to  preserve 
band  structure,  band  reduction  algorithms  based  on  orthogonal  similarity  transformations 
proceed  by  an  annihilate-and-chase  approach.  Annihilating  entries  within  the  band  creates 
fill-in  (bulges);  to  preserve  sparsity,  these  bulges  are  chased  off  the  band  before  annihilat¬ 
ing  subsequent  entries.  A  general  framework  for  this  procedure,  known  as  successive  band 
reduction  (SBR),  appears  in  [38]. 

The  main  contributions  of  this  chapter  are  the  following: 

•  we  describe  new  techniques  for  avoiding  communication  in  the  context  of  SBR, 

•  we  present  both  new  sequential  algorithms  and  improvements  on  existing  ones  that 
asymptotically  reduce  both  bandwidth  and  latency  costs, 

•  we  introduce  a  new  parallel  algorithm  that  requires  asymptotically  fewer  messages  than 
previous  approaches,  and 

•  we  describe  how  the  new  sequential  and  parallel  band  reduction  algorithms  can  be  used 
in  the  context  of  the  dense  problem  to  attain  the  corresponding  communication  lower 
bounds. 

Although  no  communication  lower  bound  is  known  for  the  band  reduction  problem  in  iso¬ 
lation,  we  demonstrate  that  previous  approaches  communicate  asymptotically  more  than 
necessary.  Our  results  also  prove  that  the  assumption  of  forward  progress  (Definition  4.18) 
is  a  necessary  condition  for  the  lower  bound  of  Theorem  4.25.  That  is,  our  algorithms  break 
the  assumption  and  beat  the  lower  bound,  attaining  asymptotically  better  data  re-use  than 
the  lower  bound  allows  (see  Section  10.1.4  for  more  details). 

While  the  symmetric  band  eigenproblem  is  interesting  in  its  own  right,  this  work  is 
motivated  by  the  high  communication  costs  of  the  standard  algorithms  for  solving  the  full 
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(dense)  symmetric  problem  via  tridiagonalization.  Greater  efficiency  than  the  standard 
approach  can  be  obtained  if  the  tridiagonalization  procedure  is  split  into  two  steps:  reducing 
the  full  matrix  to  band  form  and  then  reducing  the  band  matrix  to  tridiagonal  form.  Thus, 
by  reducing  the  communication  and  improving  the  algorithm  for  tridiagonalizing  a  band 
matrix,  we  can  also  improve  algorithms  for  tridiagonalizing  a  full  matrix.  In  fact,  we  can 
attain  the  communication  lower  bound  that  applies  to  the  dense  problem  by  using  the  two- 
step  approach  and  the  algorithms  described  in  this  chapter  for  the  band  reduction  step. 
While  we  focus  on  real  symmetric  matrices  in  this  chapter,  the  ideas  here  can  be  readily 
applied  to  tridiagonalization  of  complex  Hermitian  matrices  as  well  as  bidiagonalization  of 
general  matrices  (for  singular  value  problems). 

The  rest  of  the  chapter  is  organized  as  follows.  In  Section  10.2,  we  extend  the  band 
reduction  algorithm  design  space  with  new  techniques  for  avoiding  communication.  The 
main  novel  contribution  is  the  idea  of  chasing  multiple  bulges  in  the  context  of  SBR.  In 
Section  10.3,  we  give  an  asymptotic  complexity  analysis  of  previous  approaches,  and  show 
how  our  new  techniques  can  be  used  to  improve  their  communication  costs.  We  also  in¬ 
troduce  a  new  algorithm,  CASBR,  which  communicates  asymptotically  less  than  all  other 
approaches.  In  Section  10.4,  we  extend  CASBR  to  a  distributed- memory  parallel  algorithm 
which  communicates  asymptotically  fewer  messages  than  previous  approaches. 

All  of  the  results  in  this  chapter  appear  in  [30],  written  with  coauthors  James  Demmel 
and  Nicholas  Knight.  A  preliminary  version  of  the  work  appeared  in  [31].  The  multiple  bulge 
chasing  approach  and  sequential  CASBR  algorithm  (for  eigenvalues  only)  first  appeared  in 
that  paper.  We  also  showed  how  to  extend  the  sequential  algorithm  to  a  shared-memory 
parallel  environment,  and  our  implementations  obtained  2  —  6x  speedups  over  state-of-the- 
art  library  implementations.  This  chapter  extends  those  results  in  two  ways:  we  discuss 
distributed-memory  algorithms  and  consider  computing  both  eigenvalues  and  eigenvectors, 
though  we  do  not  give  implementation  details  or  performance  results  here. 

10.1  Preliminaries 

10.1.1  Eigendecomposition  of  Band  Matrices 

In  this  chapter,  we  are  interested  in  computing  the  eigenvalues  (and  possibly  the  eigenvec¬ 
tors)  of  a  symmetric  band  matrix  via  tridiagonalization.  Let  A  e  Mnxn  be  a  symmetric 
band  matrix  with  bandwidth  b  (he.,  having  26  +  1  nonzero  diagonals).  Because  we  preserve 
symmetry,  it  is  sufficient  to  store  and  operate  on  only  the  lower  6+1  diagonals  of  A.  We  re¬ 
duce  A  to  a  symmetric  tridiagonal  matrix  T  via  orthogonal  similarity  transformations  which 
comprise  an  orthogonal  matrix  Q  such  that  QTAQ  =  T.  We  refer  to  this  process  as  the  band 
reduction  phase.  We  assume  the  eigendecomposition  of  the  tridiagonal  matrix  T  is  computed 
via  an  efficient  algorithm  such  as  Bisection/Inverse  Iteration,  MRRR,  Divide-and-Conquer, 
or  QR  Iteration  (e.g.,  see  [63]),  and  we  ignore  the  computation  and  communication  costs  of 
this  phase. 
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If  only  eigenvalues  are  desired,  the  eigenvalues  of  T  are  the  eigenvalues  of  A,  so  no  extra 
computation  is  required  and  Q  need  not  be  computed  or  stored.  If  eigenvectors  are  also 
desired,  then  a  back-transformation  phase  is  needed  to  reconstruct  the  eigenvectors  of  A 
from  the  eigenvectors  of  T.  That  is,  if  the  eigendecomposition  of  T  is  given  by  T  =  V AVT , 
then  the  eigendecomposition  of  A  is  A  =  (QV)A(QV)T ,  so  to  compute  the  eigenvectors  of  A, 
we  must  compute  Q V.  There  are  a  range  of  possibilities  for  computing  Q V:  if  we  form  Q  and 
V  explicitly,  then  this  can  be  done  with  matrix  multiplication;  if  we  store  Q  implicitly  ( e.g ., 
as  a  set  of  Householder  vectors),  then  it  can  be  applied  to  V  after  V  is  formed  explicitly; 
if  QR  Iteration  is  used  to  compute  the  eigendecomposition  of  T,  then  Q  should  be  formed 
explicitly  so  that  V  can  be  applied  implicitly  to  Q  from  the  right  as  it  is  computed;  or  Q 
and  V  can  be  left  implicit,  allowing  us  to  multiply  by  them  when  needed. 

In  many  applications  only  a  subset  of  eigenpairs  are  desired.  The  cost  of  the  back- 
transformation  can  be  made  proportional  to  the  number  of  eigenpairs  desired;  this  can 
significantly  improve  the  runtime.  Here,  we  consider  only  the  case  of  computing  all  n  eigen¬ 
pairs. 

One  of  the  most  important  applications  of  solving  the  symmetric  band  eigenproblem  is 
when  solving  the  full  symmetric  eigenproblem.  An  efficient  alternative  to  direct  tridiagonal- 
ization  [150]  is  a  two-step  approach  [38]:  (1)  reducing  the  full  matrix  to  a  band  matrix,  and 
(2)  reducing  the  band  matrix  to  tridiagonal  form.  Both  direct  and  two-step  tridiagonaliza- 
tion  approaches  use  orthogonal  similarity  transformations.  The  remainder  of  this  chapter 
concerns  step  (2);  we  discuss  step  (1)  briefly  in  Section  10.5. 

10.1.2  SBR  Notation 

We  follow  notation  from  [38]  and  the  authors’  related  papers  to  describe  the  terminology 
associated  with  successive  band  reduction  (SBR),  our  approach  for  reducing  A  to  T.  While 
we  do  not  give  a  complete  description  of  SBR  here,  Figure  2  in  [38]  is  particularly  helpful 
for  visualizing  the  framework. 

To  exploit  symmetry,  we  store  and  operate  on  only  the  lower  triangle  of  the  band  matrix, 
though  analogous  algorithms  apply  to  the  upper  triangle.  When  we  refer  to  a  column  of  the 
band,  we  mean  the  entries  of  the  column  on  and  below  the  diagonal. 

In  a  given  sweep,  SBR  eliminates  d  subdiagonals  in  sets  of  c  columns,1  using  an  annihilate- 
and-chase  approach.  We  assume  Householder  transformations  are  used;  each  set  of  trans¬ 
formations  eliminates  a  d.-by-c  parallelogram  of  nonzeros  but  creates  trapezoidal-shaped  fill 
(a  bulge).  Using  analogous  orthogonal  similarities,  SBR  chases  each  bulge  off  the  end  of  the 
band,  translating  the  bulge  b  columns  to  the  right  with  each  bulge  chase.  Figure  10.1  shows 
the  data  access  pattern  of  a  single  bulge  chase.  A  QR  decomposition  of  the  (d  +  1  +  c)-by-c 
matrix  (QR  region  in  Figure  10.1)  containing  the  parallelogram  computes  the  orthogonal 
matrix  that  annihilates  the  parallelogram;  the  corresponding  rows  (PRE  region)  are  up¬ 
dated  with  a  premultiplication  of  the  orthogonal  matrix;  the  corresponding  columns  (POST 

1We  depart  from  the  LAPACK-style  notation  nb  of  [38]. 
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Figure  10.1:  Anatomy  of  the  bulge  chasing  operation.  Following  the  notation  of  [38],  the 
bulge  chasing  operation  based  on  an  orthogonal  similarity  transformation  can  be  decom¬ 
posed  into  four  parts.  There  are  d  diagonals  in  each  bulge  and  c  is  the  number  of  columns 
annihilated  during  a  bulge  chase  which  leaves  behind  triangular  fill. 


region)  are  updated  with  a  postmultiplication  by  the  transpose  of  the  orthogonal  matrix, 
creating  the  next  bulge;  and  the  lower  half  of  the  corresponding  symmetric  submatrix  on 
the  diagonal  (SYM  region)  is  updated  from  both  the  left  and  right. 

We  define  the  working  bandwidth  b  +  d  +  1  to  be  the  number  of  subdiagonals  necessary 
to  store  the  6  +  1  diagonals  of  the  matrix  as  well  as  to  store  the  d  diagonals  that  hold 
temporary  fill-in  during  the  course  of  a  sweep.  As  observed  in  [115],  we  note  that  an  entire 
bulge  need  not  be  eliminated;  only  the  first  c  columns  of  the  bulge  must  be  annihilated  to 
prevent  subsequent  bulges  from  introducing  nonzeros  beyond  the  working  bandwidth.  This 
results  in  temporary  triangular  fill. 

We  index  sweeps  with  an  integer  i,  where  i  —  1  is  the  first  sweep,  so  b\  =  6  is  the 
initial  bandwidth.  We  index  the  parallelograms  which  initiate  each  bulge  chase  by  j  and  the 
sequence  of  following  bulges  by  the  ordered  pairs  (j,  k):  j  is  the  parallelogram  index  and  k 
is  the  bulge  index,  as  in  [41]. 

10.1.3  Related  Work 

In  this  section  we  discuss  the  previous  approaches  for  band  reduction  in  both  sequential  and 
parallel  cases.  For  the  most  competitive  algorithms,  we  provide  more  detailed  communication 
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cost  analyses  in  later  sections.  See  Tables  10.1-10.4  for  summaries  and  comparisons  of 
communication  costs. 

10.1.3.1  Sequential  Algorithms 

The  two  papers  of  Bischof,  Lang,  and  Sun  [38,  43]  provide  a  general  framework  of  sequential 
SBR  algorithms.  Their  approach  first  appeared  in  [40]  and  generalizes  most  of  the  related 
work  described  in  this  section. 

The  annihilate-and-chase  strategy  began  with  Rutishauser  and  Schwarz  in  1963.  Rutis- 
hauser  [126]  identified  two  extreme  points  in  the  SBR  algorithm  design  space:  (1)  a  Givens 
rotation-based  approach  with  b  sweeps  and  Q  =  d;  =  1  for  each  i  and  (2)  a  column-based 
approach  with  one  sweep  where  c.\  —  1  and  d\  =  6  —  1.  Rutishauser ’s  first  approach  con¬ 
sidered  only  pentadiagonal  matrices;  Schwarz  [132]  generalized  the  algorithm  to  arbitrary 
bandwidths.  Later,  Schwarz  [133]  proposed  a  different  algorithm  based  on  Givens  rotations 
which  does  not  fit  in  the  SBR  framework.  This  algorithm  eliminates  entries  by  column  rather 
than  by  diagonal  and  does  not  generalize  to  parallelograms. 

Murata  and  Horikoshi  [115]  improved  on  Rutishauser’s  column-based  algorithm  by  noting 
that  computation  can  be  saved  by  eliminating  only  the  first  column  of  the  triangular  bulge 
rather  than  the  entire  triangle.  If  eigenvectors  are  desired,  Bischof,  Lang,  and  Sun  [41] 
showed  that,  with  this  approach,  the  Householder  vectors  comprising  Q  can  be  stored  in  a 
lower  triangular  n-by-n  matrix  and  applied  to  V  in  a  different  order  than  they  were  computed, 
yielding  higher  performance  during  the  back-transformation  phase. 

Kaufman  [98]  vectorized  the  Rutishauser/ Schwarz  algorithm  [126,  132],  chasing  multiple 
single-element  bulges  in  each  vector  operation.  Her  motivation  for  chasing  multiple  bulges 
was  not  locality  but  rather  to  increase  the  length  of  the  vector  operation  beyond  the  band¬ 
width  b.  Several  years  later,  Kaufman  [97]  took  the  approach  of  [133]  in  order  to  maximize 
the  vector  operation  length  (especially  in  the  case  of  large  b)  and  make  use  of  a  BLAS 
subroutine  when  appropriate.  When  eigenvectors  are  requested,  the  Q  matrix  is  formed  ex¬ 
plicitly  by  applying  the  updates  to  an  identity  matrix.  By  exploiting  sparsity,  the  flop  cost 
of  constructing  Q  is  about  (4/3)n3,  compared  with  2 n3  if  sparsity  is  ignored.  The  current 
LAPACK  [8]  reference  code  for  band  reduction  (sbtrd)  is  based  on  [97]. 

More  recently,  Rajamanickam  [123]  proposed  and  implemented  a  different  way  of  elimi¬ 
nating  a  parallelogram  and  chasing  its  fill.  His  algorithm  uses  Givens  rotations  to  eliminate 
the  individual  entries  of  a  parallelogram,  and  instead  of  creating  a  large  bulge,  the  update 
rotations  are  pipelined  such  that  as  soon  as  an  element  is  filled  in  outside  the  band,  it  is 
immediately  annihilated.  The  rotations  are  carefully  ordered  to  obtain  temporal  and  se¬ 
quential  locality.  By  avoiding  the  fill-in,  this  algorithm  does  up  to  50%  fewer  flops  than  the 
Householder-based  elimination  of  parallelograms  within  SBR  and  requires  minimal  working 
bandwidth. 
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10.1.3.2  Parallel  Algorithms 

Lang  [103,  102]  implemented  a  distributed-memory  parallel  version  of  the  band  reduction 
algorithm  in  [115],  although  he  did  not  consider  computing  Q.  Bichof  et  al.  [39]  implemented 
a  distributed- memory  parallel  instance  of  the  SBR  framework  in  the  context  of  tridiagonal- 
izing  a  full  matrix.  A  subsequent  paper  [41]  extended  this  implementation  to  reorganize  and 
block  the  orthogonal  updates  comprising  Q. 

Luszczek  et  al.  [109]  implemented  the  band  reduction  algorithm  from  [115]  as  part  of  a 
two-step  shared-memory  tridiagonalization  algorithm  in  the  PLASMA  library  [6],  using  dy¬ 
namic  DAG-scheduling  of  tile-based  tasks.  They  distinguished  between  “right-looking”  and 
“left-looking”  variants:  right-looking  algorithms  chase  a  bulge  entirely  off  the  band  before 
eliminating  the  next  parallelogram,  left-looking  algorithms  chase  bulges  only  far  enough  to 
allow  for  the  next  bulge  to  be  created  (see  Constraint  2).  For  example,  the  SBR  framework 
[38]  is  right-looking  while  Kaufman’s  algorithm  [97]  is  left-looking.  In  [109],  they  found  im¬ 
proved  performance  with  a  left-looking  variant.  Later,  Haidar  et  al.  [84]  reduced  the  runtime 
of  [109];  the  improvements  in  the  band-to-tridiagonal  step  include  using  an  algorithm-specific 
(static)  scheduler,  “grouping”  related  tasks,  and  avoiding  fill-in  using  pipelined  Givens  rota¬ 
tions  (a  single-sweep  version  of  the  approach  in  [123]). 

Auckenthaler  et  al.  [10,  11,  12]  have  implemented  a  two-step  distributed-memory  tridiag¬ 
onalization  algorithm  as  part  of  a  solver  for  the  generalized  symmetric  eigenproblem.  Their 
band-to-tridiagonal  step  uses  an  improved  version  of  Lang’s  algorithm  [102],  which  performs 
one  sweep.  They  give  a  new  algorithm  for  orthogonal  updates  which  uses  a  2D  processor 
layout  instead  of  a  ID  layout.  Their  implementation  also  supports  taking  multiple  sweeps 
when  eigenvectors  are  not  requested;  however,  this  algorithm  is  not  given  explicitly. 

10.1.4  Related  Lower  Bounds 

No  communication  lower  bound  has  been  established  for  annihilate-and-chase  band  reduction 
algorithms,  so  we  cannot  conclude  that  our  new  algorithms  are  communication  optimal  in 
an  asymptotic  sense.  In  fact,  Theorem  4.25  in  Section  4.3,  which  applies  to  many  algorithms 
that  use  orthogonal  transformations,  does  not  apply  to  SBR  or  its  variants  because  they  fail 
to  satisfy  forward  progress  (Definition  4.18).  That  is,  the  lower  bound  proof  there  requires 
that  an  orthogonal  transformation  algorithm  not  fill  in  a  previously  created  zero — this  occurs 
frequently  in  SBR,  unlike  QR  decomposition. 

The  main  results  of  Chapter  4  state  that  an  applicable  algorithm  that  performs  G  flops 
must  move  D(G/\/M)  words  and  send  D(G / M3//2)  messages  for  sufficiently  large  problems. 
For  most  dense  matrix  algorithms,  the  number  of  flops  is  G  =  0(n3/P),  where  P  =  1  for 
the  sequential  case.  In  the  parallel  case,  if  we  assume  minimal  local  memory  is  used  (he., 
M  =  0(n2/P),  or  just  enough  to  store  the  input  and  output  matrices),  the  the  lower  bounds 
simplify  to  D(n2 /\fP)  words  and  Ut(yfP)  messages. 

Since  our  new  sequential  algorithm  (see  Algorithm  10.1  and  Table  10.1)  performs  0(n2b ) 
flops,  moves  0(n2b2/M )  words,  and  sends  0{n2b2 /M2)  messages,  its  bandwidth  and  latency 
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costs  drop  below  the  lower  bounds  by  a  factor  of  0(\[M /b)  for  2  <  b  <  \[M /3.  For  small  b 
and  large  n  (such  that  the  band  does  not  fit  entirely  in  fast  memory),  this  discrepancy  is  as 
much  as  0(\fM).  Similarly,  our  new  parallel  algorithm  (see  Algorithm  10.3  and  Table  10.3) 
also  beats  the  lower  bounds  for  bandwidth  and  latency  costs,  and  the  discrepancy  is  largest 
for  small  bandwidths.  Thus,  our  algorithms  show  that  not  only  does  the  lower  bound  proof 
technique  not  apply  to  annihilate-and-chase  algorithms,  the  bound  itself  must  not  apply. 


10.2  Avoiding  Communication  in  Successive  Band 
Reduction 

The  goal  of  our  algorithms  is  to  avoid  communication  by  reorganizing  computation,  ex¬ 
tending  the  SBR  framework  to  obtain  greater  data  locality.  In  the  sequential  case,  we  can 
asymptotically  reduce  the  number  of  words  and  messages  that  must  be  moved  between  fast 
and  slow  memory  during  the  execution  of  the  algorithm;  in  the  parallel  case,  we  can  asymp¬ 
totically  reduce  the  number  of  messages  sent  between  processors.  We  achieve  data  locality 
(he.,  avoid  communication)  using  two  techniques  described  in  Sections  10.2.1  and  10.2.2. 
We  navigate  the  constraints  and  tradeoffs  that  arise  using  a  successive  halving  approach, 
described  in  Section  10.2.3. 

10.2.1  Applying  Multiple  Householder  Transformations 

The  first  means  of  achieving  data  locality  is  within  a  single  bulge  chase  (see  Figure  10.1). 
Since  c  Householder  vectors  are  computed  to  eliminate  the  first  c  columns  of  the  bulge 
(QR  region),  every  entry  in  the  PRE,  SYM,  and  POST  regions  is  updated  by  c  left  and/or 
right  Householder  transformations.  These  transformations  may  be  applied  one  at  a  time  or 
blocked  ( e.g .,  via  [131]).  Assuming  all  the  data  involved  in  a  single  bulge  chase  reside  in  fast 
or  local  memory,  O(c)  flops  are  performed  for  every  entry  read  from  slow  memory. 

We  identify  the  following  algorithmic  constraint.  If  it  is  violated,  then  the  parallelogram 
annihilated  by  the  left  update  will  be  (partially)  refilled  by  the  right  update  (i.e.,  the  SYM 
and  POST  regions  overlap  the  QR  region) — this  implies  wasted  computation. 

Constraint  1.  To  annihilate  a  parallelogram  within  the  SBR  framework,  the  dimensions  of 
the  parallelogram  must  satisfy 

c  Y  d  <b. 

While  increasing  c  improves  data  locality,  it  limits  the  size  of  d  due  to  Constraint  1. 
Because  d  is  the  number  of  diagonals  eliminated  in  a  sweep,  this  constraint  creates  a  tradeoff 
between  locality  and  progress  towards  tridiagonal  form. 
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10.2.2  Chasing  Multiple  Bulges 

The  second  means  of  achieving  data  locality  is  across  bulge  chases.  If  to  bulges  can  be  chased 
through  the  same  set  of  columns  without  data  movement,  then  we  have  achieved  O(to)  reuse 
of  those  columns.  Recall  that  we  refer  to  columns  as  the  subset  of  column  entries  on  and 
below  the  diagonal.  We  first  establish  the  following  constraint. 

Constraint  2.  No  bulge  may  be  chased  into  a  set  of  columns  still  occupied  by  a  previously 
created  bulge. 

If  this  constraint  is  violated,  then  the  fill  will  expand  beyond  the  working  bandwidth 
of  the  sweep.  While  it  is  possible  to  eliminate  this  extra  fill,  we  wish  to  avoid  the  extra 
computation  and  storage  necessary  to  do  so.  Chasing  the  first  c  columns  of  a  bulge  and 
leaving  behind  the  triangular  fill  is  the  least  amount  of  work  required  to  prevent  the  fill  from 
exceeding  the  working  bandwidth. 

We  state  the  following  lemmas  regarding  parallelograms,  bulges,  sets  of  bulges,  and  the 
working  set  (measured  in  columns)  for  chasing  a  set  of  bulges.  We  assume  in  both  cases  that 
Constraints  1  and  2  are  satisfied. 

Lemma  10.1.  Given  a  sweep  of  SBR  with  parameters  b,  c,  and  d, 

(a)  the  jth  parallelogram  occupies  columns  1  +  (j  —  l)c  through  jc, 

(b)  bulge  ( j ,  k)  occupies  columns  1  +  (j  —  l)c  +  kb  —  d  through  jc  +  kb, 

(c)  bulges  ( j ,  k )  and  ( j  +  1,  k  —  2)  do  not  overlap.2 
Lemma  10.2.  Chasing  the  set  of  to  bulges 

{(j,  k),(j  +  1,  k  -  2), . . .  ,(j  +  to  -  1,  k  -  2  (to  -  1))} 
each  I  times  requires  a  working  set  of  (to  —  1)  (26  —  c)  +  c  +  d  +  bb  columns. 

Proof.  By  Lemma  10.1(c),  this  set  of  bulges  is  nonoverlapping.  If  the  bulges  are  chased  in 
turn  I  times  each,  starting  with  the  right-most  bulge  (j,  k )  and  ending  with  the  left-most 
bulge  (j  +  to  —  1,  k  —  2(to  —  1)) ,  then  there  is  no  violation  of  Constraint  2.  The  conclusion 
follows  from  Lemma  10.1(b).  □ 

Figure  10.2  demonstrates  the  working  set  of  44  columns  with  to  =  2  bulges  chased  l  —  3 
times  each  on  a  matrix  with  bandwidth  6  =  8  with  c  =  d  =  4. 

Our  motivation  for  defining  a  working  set  is  to  ensure  that  the  operation  of  chasing  to 
bulges  t  times  can  be  done  entirely  in  fast  memory  (in  the  sequential  case)  or  local  memory 
(in  the  parallel  case).  We  will  specify  the  constraints  in  each  case  when  we  present  our 
algorithms  below. 


2Note  that  if  2c  +  d  <  b,  bulges  (j,  k)  and  (j  +  1,  k  —  1)  also  do  not  overlap. 
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(a)  The  to  =  2  bulges  occupy  20  (b)  The  first  (right-most)  bulge  (c)  The  second  bulge  is  chased 

of  the  24  columns  on  the  left.  is  chased  I  =  3  times.  I  =  3  times. 

Figure  10.2:  Chasing  a  set  of  bulges.  We  store  and  operate  on  only  the  lower  triangle  of  the 
band  matrix.  The  parameters  shown  are  b  =  8,  c  =  4,  and  d  =  4;  u  =  2  bulges  are  chased 
1  =  3  times  each.  Only  2 bt  =  48  columns  of  the  band  are  shown.  The  working  bandwidth 
includes  the  diagonals  which  contain  bulges  and  triangular  fill.  Note  that  the  triangular  fill 
left  behind  by  the  first  bulge  does  not  cause  any  increase  in  the  working  bandwidth  as  the 
second  bulge  is  chased. 


10.2.3  Successive  Halving 

We  will  navigate  the  tradeoff  imposed  by  Constraint  1  by  setting  c%  =  di  =  bt/2  at  each 
sweep  i,  reducing  to  tridiagonal  form  after  log  b  sweeps.  We  call  this  a  successive  halving 
approach.  We  will  pick  the  number  of  bulges  in  a  set  (a;*)  and  the  number  of  times  each  bulge 
is  chased  (-C)  such  that  on  each  sweep  (as  the  bandwidth  is  successively  halved)  we  double 
the  number  of  bulges  that  we  chase  in  a  set,  and  chase  each  bulge  twice  as  many  times, 
compared  to  the  previous  sweep.  While  the  successive  halving  approach  (and  doubling  u)i 
and  li)  simplifies  our  asymptotic  analysis,  in  practice  the  parameters  {q,  di,  oji,  £i}  should 
be  tuned  independently  for  best  performance — we  suggested  a  framework  for  automatically 
tuning  these  parameters  in  a  shared- memory  implementation  [31,  Section  5]. 


10.3  Sequential  Band  Tridiagonalization  Algorithms 

Recall  our  sequential  machine  model,  where  communication  is  moving  data  between  slow 
memory  of  unbounded  capacity  and  a  fast  memory  with  a  capacity  of  M  words.  We  will 
first  consider  the  case  of  computing  eigenvalues  only  and  then  extend  to  the  case  of  comput¬ 
ing  both  eigenvalues  and  eigenvectors.  We  will  not  analyze  the  solution  of  the  tridiagonal 
eigenproblem.  In  each  case,  we  discuss  existing  approaches,  apply  our  techniques  to  improve 
them,  and  then  present  our  communication-avoiding  approach. 
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For  our  sequential  algorithms,  we  will  assume  the  initial  bandwidth  b  is  bounded  above 
by  a/M/3.  As  mentioned  in  Sections  10.1.1  and  10.5,  this  is  a  reasonable  assumption  if 
the  band  reduction  is  used  as  the  second  step  of  a  two-step  reduction  of  a  full  symmetric 
matrix  to  tridiagonal  form.  For  larger  bandwidths,  another  approach  must  be  taken  to  avoid 
communication  (see  Section  10.5).  We  also  assume  that  nb  M  (the  band  does  not  fit  in 
fast  memory). 

10.3.1  Computing  Eigenvalues  Only 

When  only  eigenvalues  are  desired,  the  runtime  is  dominated  by  the  band  reduction.  Com¬ 
puting  the  eigenvalues  of  a  tri diagonal  matrix  involves  only  0(n )  data  and  less  computation 
than  the  band  reduction — 0(n2)  as  opposed  to  0(n2b).  While  there  is  a  large  design  space 
for  band  reduction,  the  computational  cost  ranges  from  4 n2b  to  6 n2b,  a  difference  of  only 
50%  (as  long  as  a  bulge-chasing  procedure  is  used  to  prevent  unnecessary  fill) .  However,  the 
communication  cost  (and  expected  performance)  has  a  much  larger  range. 

Under  the  assumption  above,  the  matrix  does  not  fit  in  fast  memory  (otherwise,  the 
communication  costs  are  the  same  for  all  algorithms:  0(nb)).  In  the  case  that  n  <  M  (i.e., 
one  or  more  diagonals  fit  in  fast  memory),  when  the  bandwidth  is  reduced  such  that  the 
remaining  band  matrix  fits  in  fast  memory,  the  communication  cost  of  remaining  sweeps  is 
that  of  reading  the  band  into  fast  memory  once  and  writing  the  tridiagonal  output. 

Table  10.1  summarizes  the  computation  and  communication  costs  of  various  algorithms 
for  tridiagonalizing  a  band  matrix  (for  computing  eigenvalues  only).  Our  new  approach, 
CASBR,  improves  the  communication  costs  compared  to  the  previous  approaches.  For  ex¬ 
ample,  CASBR  moves  a  factor  of  M/b  fewer  words  than  LAPACK  or  MH,  which  is  at  least 
a/M  in  the  range  of  b  considered,  and  near  M  for  b  =  0(1).  Note  that  while  the  compu¬ 
tational  costs  vary  only  by  constant  factors,  these  factors  can  make  a  difference  in  practice. 
The  tradeoffs  between  different  algorithms  and  between  computation  and  communication 
should  be  navigated  with  autotuning  (of  algorithms  and  parameters)  in  practice.  In  the 
context  of  two-step  tridiagonalization  of  a  dense  matrix,  CASBR  is  the  only  approach  that 
always  attains  (or  beats)  the  lower  bounds  discussed  in  Section  10.1.4. 

10.3.1.1  Alternative  Algorithms 

We  first  consider  Kaufman’s  algorithm  [97],  which  is  implemented  in  the  current  LAPACK 
reference  code  [8],  given  in  the  first  row  of  Table  10.1.  The  algorithm  uses  Givens  rotations 
and  performs  4 n2b  flops.  It  is  left-looking  and  chases  multiple  single-element  bulges  in  order 
to  maximize  the  vector  operation  length,  but  it  does  not  limit  the  size  of  the  working  set 
to  fit  in  fast  memory.  As  a  result,  the  algorithm  has  to  read  (from  slow  memory)  at  least 
one  of  each  pair  of  entries  to  be  updated  by  a  Givens  rotation.  Thus,  the  data  reuse  is  0(1) 
and  the  total  number  of  words  transferred  between  fast  and  slow  memory  is  proportional 
to  the  number  of  flops:  0(n2b).  Since  fine-grained  data  access  occurs  along  both  rows  and 
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Algorithm 

Flops 

Words 

Messages 

LAPACK  [97] 

4n26 

0(n2b) 

0(n2b) 

MH  [115] 

6  n2b 

0(n2b) 

dJVFj 

Improved  MH 

6  n2b 

O  W) 

o( W) 

SBR  [38] 

S 

El 

i=  1 

+KH 

+!'*+) 

SBR  (a  =  di  =  bi/2 ) 

5  n2b 

0(n2t ) 

°W) 

CASBR 

5  n2b 

0  ( Wj 

b  ( + ) 

Table  10.1:  Asymptotic  comparison  of  previous  sequential  algorithms  for  tridiagonalization 
(for  eigenvalues  only)  with  our  improvements,  for  symmetric  band  matrices  of  n  columns 
and  6  +  1  subdiagonals  on  a  machine  with  fast  memory  of  size  M.  The  table  assumes  that 
nb  3>  M  and  that  2  <  6  <  \[M / 3.  The  analysis  for  all  algorithms  is  given  in  Section  10.3. 
In  the  fourth  and  fifth  rows,  s  is  the  number  of  sweeps  performed  and  t  <  s  is  the  smallest 
sweep  index  such  that  the  subsequent  sweeps  can  be  performed  in  fast  memory,  or  t  =  s 
otherwise. 


columns,  the  latency  cost  is  on  the  same  order  as  the  bandwidth  cost,  assuming  LAPACK’s 
column-major  layout. 

Next,  we  consider  the  Householder-based  approach  of  Murata  and  Horikoshi  [115],  given 
as  MH  in  the  second  row  of  Table  10.1.  In  this  algorithm,  each  column  is  eliminated  all  at 
once,  and  the  bulge  is  chased  completely  off  the  band  before  the  next  column  is  eliminated. 
Because  of  operations  on  the  triangular  fill,  the  number  of  flops  required  increases  to  6n26 
compared  to  Givens-based  algorithms.  Since  each  bulge  is  chased  entirely  off  the  band,  the 
entire  band  must  be  read  from  slow  memory  for  every  column  eliminated,  a  total  of  0(n2b) 
words  moved.  Assuming  column-major  layout,  the  sequence  of  bulge  chases  for  each  column 
(i.e.,  bulges  ( j ,  k)  for  fixed  j )  is  executed  on  contiguous  data,  and  the  latency  cost  is  a  factor 
of  O(M)  less  than  the  bandwidth  cost. 

In  order  to  reduce  communication  costs  for  the  MH  algorithm  it  is  possible  to  apply  one 
of  the  optimizations  described  in  Section  10.2:  chasing  multiple  bulges.  From  Lemma  10.2, 
we  can  chase  0(M/b2)  bulges  at  a  time  and  maintain  a  working  set  which  fits  in  fast  memory. 
This  results  in  a  reduction  of  both  bandwidth  and  latency  costs  by  a  factor  of  0(M/b2).  We 
call  this  algorithm  “Improved  MH,”  given  in  the  third  row  of  Table  10.1. 

Consider  an  algorithm  within  the  SBR  framework  with  parameters  {(6*,  q,  dj)}j_i;2,...,s, 
which  does  not  chase  multiple  bulges  at  a  time  (i.e.,  ul  —  1  for  every  i).  This  corresponds  to 
the  fourth  row  of  Table  10.1.  The  flop  count  is  given  by  [38,  Equation  (3)]  (and  Lemma  10.7 
below).  Note  that  the  approach  of  [123]  allows  the  computational  cost  to  be  reduced  to 
4n26  for  all  parameter  choices.  Since  the  SBR  framework  is  right-looking,  the  trailing  band 
must  be  read  for  each  parallelogram  eliminated.  During  the  ith  sweep,  there  are  0(njc/) 
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parallelograms  and  each  parallelogram  is  chased  0{n/b/)  times.  The  amount  of  data  accessed 
during  one  bulge  chase  is  0(fy(cj  +  d/))  words — for  example,  6*  columns  are  accessed  during 
the  left  update  and  each  bulge  occupies  <y  +  d*  rows.  Thus,  the  number  of  words  read  during 
the  ith  sweep  is  0(n2(l  +  di/c/)).  In  the  best  case,  the  latency  cost  is  a  factor  of  M  smaller 
than  the  bandwidth  cost. 

If  we  apply  the  successive  halving  approach  (<y  =  dt  =  bi/2)  but  do  not  chase  multiple 
bulges,  then  the  costs  of  SBR  simplify  to  0(n2t )  words  (where  t  <  log  b  is  the  smallest  sweep 
index  such  that  n(bt  +  dt  +  1)  <  M,  or  t  —  log  b  otherwise)  and  0(n2t/M)  messages,  in  the 
best  case.  These  costs  appear  in  the  fifth  row  of  Table  10.1. 

10.3.1.2  CASBR 

The  communication  avoiding  sequential  algorithm,  shown  in  Algorithm  10.1,  is  based  on  the 
framework  given  in  [38],  using  the  successive  halving  approach  (see  Section  10.2.3).  Our 
main  deviation  from  the  original  SBR  framework  is  chasing  multiple  bulges  at  a  time,  as 
described  in  Section  10.2.2.  Recall  that  ay  denotes  the  number  of  bulges  chased  at  a  time, 
and  ti  the  number  of  times  each  bulge  is  chased,  during  sweep  i.  We  would  like  to  maximize 
ay  so  that  for  some  ti  >  1,  this  working  set  fits  in  a  fast  memory  of  size  M  words.  We 
ignore  the  sparsity  below  the  6jth  subdiagonal  by  assuming  each  column  has  /y  +  d*  +  1 
nonzeros  (he.,  the  working  bandwidth).  It  follows  from  Lemma  10.2  that  we  would  like  to 
pick  positive  integers  ay  and  ti  such  that  ay  is  maximized  and 

((^y  —  1)(2 bi  —  Ci)  +  Ci  +  dj  +  bit/) {hi  +  d*  +  1)  <  M.  (10-1) 

We  use  a  successive  halving  approach,  as  mentioned  above.  That  is,  at  each  sweep  i,  we 
cut  the  remaining  bandwidth  bi  in  half  by  setting  d,  =  bi/2.  We  also  set  c%  =  bi/2  (which 
satisfies  Constraint  1).  To  simplify  the  analysis,  we  assume  that  the  initial  bandwidth  b  —  b\ 
is  a  power  of  two. 

As  in  Lemma  10.2,  when  chasing  a  set  of  ay  bulges,  we  work  right-to-left,  chasing  each 
bulge  ti  =  (3/2)oy  times  in  turn.  In  this  way,  after  all  bulges  in  the  set  are  chased,  the  set 
does  not  overlap  the  previous  columns  occupied,  and  the  relative  positions  of  the  bulges  are 
maintained.  This  process  is  shown  in  Figure  10.2  and  corresponds  to  line  9  in  Algorithm  10.1. 
Fixing  ti  in  terms  of  ay  also  has  the  benefit  of  decreasing  the  latency  cost  on  successive  sweeps. 
While  the  constant  ratio  between  ay  and  ti  simplifies  theoretical  analysis,  these  parameters 
can  be  tuned  independently  in  practice. 

With  these  parameter  choices  and  assumptions,  inequality  (10.1)  simplifies,  as  given  in 
the  following  constraint. 

Constraint  3.  Assuming  b  and  uj  are  even,  c  =  d  =  b/2,  t  =  (3/2)ay  and  b  <  \J M / 3,  then 
the  number  of  bulges  chased  at  a  time  must  not  exceed  u>  <  4M/(9(6  +  l)2). 

By  satisfying  Constraint  3,  we  ensure  that  the  entire  operation  can  be  performed  on 
columns  which  all  fit  in  fast  memory  simultaneously. 
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We  include  explicit  memory  operations  within  the  algorithm  in  order  to  determine  the 
communication  costs:  writes  imply  moving  data  from  fast  memory  to  slow  memory  and 
reads  imply  moving  data  from  slow  memory  to  fast  memory. 


Algorithm  10.1  Sequential  CASBR 


Require:  initial  bandwidth  b  <  y/M/3  is  a  power  of  2 


3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 
19 


2  M 


9(bi  + 1)2 


-  3/  , 

—  2  Ui 


t  —  minjlog  b,  log  } 

for  i  —  1  to  t  do 

_  b  ...  b,  i  _  bi  ...  _  o 
u i  2i~1  ">  '~"L  2  ’  2  5  ^ 

while  not  reached  end  of  band  do 
create  next  set  of  u>i  bulges 
while  not  reached  end  of  band  do 
write  previous  £,b%  columns  of  band 
read  next  libi  columns  of  band 
chase  uj%  bulges  ^  times  each 
end  while 

chase  ujt  bulges  off  the  end  of  the  band 

end  while 

copy  band  into  data  structure  with  column  height  §/y+i 

end  for 

if  t  <  log  b  then 

read  remaining  band  into  fast  memory 
reduce  band  to  tridiagonal 
write  output  to  slow  memory 

end  if 


We  omit  the  details  of  creating  a  set  of  bulges  (line  5)  and  of  chasing  bulges  at  the  end 
of  the  band  (line  11).  Both  the  arithmetic  and  communication  costs  of  creating  ca*  bulges 
or  chasing  u%  bulges  off  the  end  of  the  band  are  dominated  by  that  of  chasing  the  uj%  bulges 
It  times  each.  Also,  since  neither  operation  occurs  in  the  inner  loop  of  the  algorithm,  they 
contribute  only  lower  order  terms  to  the  costs  of  the  entire  algorithm. 

The  computation  of  t  in  line  1  determines  the  sweep  (if  any)  after  which  the  remaining 
band  fits  entirely  in  fast  memory.  Note  that  if  n  >  M,  then  the  band  will  never  fit  in  fast 
memory  and  t  =  log  b.  If  the  band  becomes  small  enough  to  fit  in  fast  memory,  then  the 
algorithm  will  stop  the  main  loop  (lines  2-14)  and  fall  to  the  clean-up  code  in  lines  15-19 
which  simply  reads  the  band  into  fast  memory,  reduces  to  tridiagonal  form,  and  writes  the 
result  back  to  slow  memory. 
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10.3.1.3  Arithmetic  Cost 

In  order  to  count  the  number  of  flops  required  by  Algorithm  10.1,  we  first  establish  two 
lemmas  related  to  the  cost  of  applying  Householder  transformations. 

Lemma  10.3.  Applying  a  Householder  transformation  from  the  left,  House (ii)  •  A,  costs  no 
more  than  Ahc  +  h  —  c  flops,  where  h  is  the  number  of  nonzeros  in  u  and  A  has  c  columns. 
Equivalently,  applying  the  transformation  from  the  right,  A  ■  House(w),  costs  no  more  than 
Ahr  +  h  —  r  flops  if  A  has  r  rows. 

Proof.  The  first  statement  is  verified  by  counting  the  operations  in  A  A  —  ( ru )  (uTA) . 
The  second  statement  is  verified  by  transposing  the  first  transformation.  □ 

Lemma  10.4.  Applying  a  Householder  transformation  symmetrically  to  an  n-by-n  symmet¬ 
ric  matrix  A,  House(w)  •  A  ■  House(w)T,  costs  no  more  than  (Ah  —  1  )n  +  5 h  flops,  where  h  is 
the  number  of  nonzeros  in  u. 

Proof.  We  perform  three  steps:  y  :=  A(ru),  v  :=  y  —  ( 1/2)  ( yTu )  u,  and  A  :=  A  —  uvT  —  vuT. 
The  first  step  costs  (2 h  —  1  )n  +  h  operations,  the  second  Ah  —  1,  and  the  third  2 nh,  if  we 
exploit  symmetry.  □ 

Given  these  lemmas,  we  can  compute  the  arithmetic  cost  of  a  single  bulge  chase. 

Lemma  10.5.  A  single  bulge  chase  costs  8bcd  +  Acd2  +  0(bc)  operations.  Creating  a  bulge, 
or  clearing  a  bulge  (off  the  end  of  the  band),  is  less  expensive. 

Proof.  We  refer  to  the  four  operations  depicted  in  Figure  10.1.  Let  1  <  m  <  c  index  the 
(unblocked)  Householder  transformations  that  eliminate  the  parallelogram  in  the  QR  region. 
Transformation  m  is  applied  from  the  left  to  c—m  columns  in  the  QR  region  and  b—c  columns 
in  the  PRE  region,  from  the  right  to  6—  (c  —  m)  rows  in  the  POST  region,  and  symmetrically 
to  a  (d.  +  c)-by-(d  +  c)  symmetric  matrix  in  the  SYM  region.  Applying  Lemmas  10.3  and 
10.4,  transformation  m  performs  8 bd  +  Ad?  +  0(b)  flops,  and  there  are  c  transformations. 
Creating  a  bulge  is  less  expensive  because  the  PRE  region  includes  only  b  —  c  —  d  columns. 
As  a  result,  transformation  m  does  fewer  flops.  Clearing  a  bulge  is  less  expensive  because 
there  are  fewer  rows  in  the  POST  region.  □ 

See  [31,  Section  5.3]  for  a  discussion  of  different  approaches  to  chasing  individual  bulges 
and  their  implications  on  performance. 

We  can  also  count  the  number  of  bulge  chases  that  occur  during  each  sweep. 

Lemma  10.6.  The  number  of  bulges  chased  during  a  sweep  with  parameters  n,  b,  c,  and  d 
is  n2/(2bc)  +  0(n/b). 


Proof.  For  each  parallelogram  eliminated,  the  bulge  must  be  chased  the  length  of  the  trailing 
band,  in  increments  of  b  columns.  Thus,  the  total  number  of  bulge  chases  during  a  sweep  is 
JfjL i  (n  —  jc) /b  =  n2 /(2bc)  +  0(n/b) .  □ 
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Lemmas  10.5  and  10.6  together  imply  the  following  fact,  which  agrees  with  [38,  Equation 

(3)]- 

Lemma  10.7.  The  arithmetic  cost  of  eliminating  d  diagonals  from  a  matrix  with  bandwidth 
b  using  SBR  is  (4 d  +  2 d2/b)  n 2  +  0(n2). 

The  order  of  operations  specified  by  the  algorithm  does  not  affect  the  arithmetic  count, 
provided  Constraints  1  and  2  are  satisfied.  Given  the  cost  of  the  ith  sweep  specified  by 
Lemma  10.7,  since  di  =  bi/2  and  JA  d%  =  5  —  1,  the  arithmetic  cost  of  Algorithm  10.1 
(ignoring  lower  order  terms)  is 


=  5  n2b. 


10.3.1.4  Bandwidth  Cost 

In  determining  the  communication  costs  of  Algorithm  10.1,  we  must  consider  two  cases.  If 
n  >  M,  then  log b  <  |~log(n(6  +  1)/M)~|  and  the  main  loop  (lines  2-14)  will  be  executed 
log  b  times,  reducing  the  band  to  tridiagonal.  However,  if  n  <  M,  then  at  some  point  the 
bandwidth  will  become  small  enough  such  that  the  entire  band  fits  in  fast  memory.  At  this 
point,  the  algorithm  reduces  to  lines  15-19  and  the  only  communication  required  to  finish 
the  reduction  is  that  of  reading  the  band  into  fast  memory  and  writing  the  tridiagonal  output 
back  to  slow  memory  for  a  cost  of  0(nbt+ 1)  words. 

We  now  consider  the  ith  sweep,  where  we  assume  the  band  is  too  large  to  fit  in  fast 
memory.  The  dominant  communication  cost  is  in  the  innermost  loop  (lines  6-10).  The 
number  of  words  in  each  column  is  (3/2)6,,  so  the  bandwidth  cost  of  one  iteration  of  the 
inner  loop  is  3t/62  =  0(M )  words.  The  inner  loop  is  executed  0(n/(C6,))  times  for  each  set 
of  bulges,  and  there  are  0(n/(ciU!i ))  sets  of  bulges  during  the  sweep.  Thus,  the  bandwidth 
cost  of  one  sweep  is  0(n262/M)  words. 

The  bandwidth  cost  (he.,  number  of  words  moved)  of  Algorithm  10.1  is  then 

10.3.1.5  Latency  Cost 

We  will  assume  the  band  matrix  is  stored  in  LAPACK  symmetric  band  storage  format 
(column-major  with  column  height  equal  to  the  working  bandwidth)  so  that  any  block  of 
columns  of  the  band  will  be  stored  contiguously  in  slow  memory.  After  each  set  of  subdi¬ 
agonals  is  annihilated  from  a  column  block,  the  algorithm  packs  the  remaining  diagonals 
into  a  smaller  data  structure  (see  line  13)  to  maintain  a  packed  column-major  layout  for  all 
successive  sweeps.  This  increases  the  memory  footprint  by  no  more  than  a  factor  of  two  and 
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can  also  be  done  in  place,  and  it  adds  only  lower  order  terms  to  the  bandwidth  and  latency 
costs. 

As  in  the  previous  section,  if  the  band  becomes  small  enough  to  fit  in  fast  memory,  then 
the  communication  costs  of  completing  the  algorithm  are  reduced  to  reading  the  band  and 
writing  the  tridiagonal  output.  In  this  case,  the  latency  cost  is  2  messages.  When  the  band  is 
too  large  to  fit  in  fast  memory,  the  dominant  latency  cost  is  that  of  the  innermost  loop.  Since 
consecutive  columns  are  stored  contiguously,  the  latency  cost  per  iteration  of  the  innermost 
loop  is  2  messages.  As  argued  above,  the  inner  loop  is  executed  0{n/{iibi ))  times  for  each 
set  of  bulges,  and  there  are  0(nj (cju;*))  sets  of  bulges  during  the  sweep.  Thus,  the  latency 
cost  of  one  sweep  is  0{n2b2/M2)  messages. 

The  latency  cost  (he.,  number  of  messages  moved)  of  Algorithm  10.1  is  then 


+  0(1)  =  o 


10.3.2  Computing  Eigenvalues  and  Eigenvectors 

When  only  eigenvalues  are  desired,  the  orthogonal  similarity  transformations  that  reduce  the 
band  matrix  to  tridiagonal  form  may  be  discarded.  However,  when  eigenvectors  are  desired, 
these  transformations  must  be  used  to  reconstruct  the  eigenvectors  QV  of  the  band  matrix 
from  the  eigenvectors  V  of  the  tridiagonal  matrix. 

Compared  to  Section  10.3.1,  the  main  difference  between  computing  eigenvalues  and 
additionally  eigenvectors  is  that  the  arithmetic  cost  of  computing  QV  increases  with  the 
number  of  sweeps  taken  in  the  band  reduction.  While  the  arithmetic  cost  of  the  band 
reduction  for  the  algorithms  discussed  in  Section  10.3.1  ranges  from  4 n2b  to  6 n2b,  that  of 
the  back-transformation  ranges  from  2 n3  up  to  ns  log  b. 

The  orthogonal  matrix  Q  can  be  constructed  explicitly  by  applying  the  updates  from  the 
band  reduction  to  an  n-by-n  identity  matrix.  Some  flops  may  be  saved  when  starting  from 
the  identity  matrix  (compared  to  applying  them  to  a  dense  matrix,  see  e.g.,  [97]),  but  the 
entries  fill  in  quickly  after  one  sweep,  and  we  will  ignore  this  savings  in  our  analysis.  Then, 
the  arithmetic  cost  of  computing  QV  given  V  is  the  cost  of  a  matrix  multiplication,  2 n3  flops. 
However,  the  cost  of  this  matrix  multiplication  can  be  avoided  by  storing  Q  implicitly  as  a 
set  of  Householder  vectors  and  applying  them  to  V.  While  this  choice  does  not  affect  our 
theoretical  analysis  of  CASBR,  it  should  be  considered  in  practice.  Storing  the  Householder 
information  for  each  sweep  requires  extra  memory  for  at  most  n2 / 2  entries  per  sweep. 

Table  10.2  shows  the  computation  and  communication  costs  for  various  approaches  to 
tridiagonalizing  a  band  matrix  (for  computing  both  eigenvalues  and  eigenvectors). 

Recall  that  one  context  of  this  work  is  two-step  tridiagonalization;  the  communication 
lower  bounds  referenced  in  Section  10.1.4  apply  to  the  first  step  (full-to-banded),  but  not  the 
second  step.  However,  note  that  a  lower  bound  for  part  of  the  algorithm  gives  a  valid  lower 
bound  for  the  whole  algorithm.  So,  we  will  compare  the  approaches  in  Table  10.2  (the  second 
step)  and  see  which  attain  the  lower  bounds  of  Vt{n3/y/M)  words  moved  and  f l(n3/M3/2) 
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Algorithm 

Flops 

Words 

Messages 

LAPACK  [97] 

2  n3 

0(n2b  +  n3) 

0(n26+^) 

Improved  LAPACK 

2  n3 

°('A  +  7ff) 

O  (n2b  +  g) 

BLS  [41] 

2  n3 

°0b+7h) 

/  n2b  |  n3  \ 

U  \  M  '  M  J 

Improved  BLS 

2  n3 

( n2b3  I  n3  \ 

^  {  M  "I"  +M ) 

( n2b 3  i  n3  \ 

(  M2  "T"  M3/'2  J 

CASBR 

tn3 

f  n2b  tn 3  \ 

u  \Vm  ^  Vm) 

f  tn2  |  tn 3  \ 

^  y  m  '  A/3/2  ) 

Table  10.2:  Asymptotic  comparison  of  previous  sequential  algorithms  for  tridiagonalization 
(for  eigenvalues  and  eigenvectors)  with  our  improvements,  for  symmetric  band  matrices  of  n 
columns  and  6  +  1  subdiagonals  on  a  machine  with  fast  memory  of  size  M.  We  include  the 
cost  of  the  back  transformation  (but  not  the  cost  of  the  tridiagonal  eigendecomposition).  The 
table  assumes  that  nb  3>  M  and  that  2  <  6  <  \[M / 3.  The  two  terms  in  the  communication 
costs  correspond  to  the  band  reduction  and  back  transformation,  respectively.  In  the  last 
row,  t  =  0(min{log  6,  log(n6/M)}). 


messages  sent;  both  bounds  are  attainable  by  the  first  step  by  setting  6  =  0(y/M).  We 
claim  that  only  CASBR  attains  these  expected  lower  bounds  for  all  ranges  of  parameters  we 
consider,  within  a  factor  of  t  =  O(logM). 

Clearly  the  costs  of  LAPACK  asymptotically  exceed  the  lower  bounds.  If  n  <+  M,  the 
bandwidth  costs  of  the  band  reduction  for  Improved  LAPACK,  BLS,  and  Improved  BLS 
asymptotically  exceed  the  lower  bound.  If  n  M,  then  the  bandwidth  costs  of  those  three 
approaches  match  the  lower  bound,  and  the  latency  cost  of  Improved  BLS  also  matches  the 
lower  bound. 

Fact  1.  The  computational  cost  of  applying  all  the  updates  from  a  single  band  reduction 
sweep  to  a  dense  n-by-n  matrix  is  2 fn3,  ignoring  lower  order  terms. 

Proof.  From  Lemma  10.6,  there  are  n2 /(26c)  bulge  chases,  each  consisting  of  c  Householder 
vectors  of  length  d+1.  From  Lemma  10.3,  the  cost  of  applying  each  Householder  transforma¬ 
tion  to  an  n-by-n  matrix  is  4(d+l)n,  so  the  total  arithmetic  cost  is  4 dn-fn2  / (2b))  =  2(d/b)n3, 
ignoring  lower  order  terms.  □ 

10.3.2.1  Alternative  Algorithms 

As  mentioned  in  Section  10.3.1.1,  the  current  LAPACK  reference  code  for  band  reduction 
(sbtrd)  is  based  on  [97].  When  eigenvectors  are  requested,  Q  can  be  either  explicitly  formed 
or  applied  to  an  input  matrix.  The  LAPACK  routine  for  solving  the  eigenproblcm  for  a 
band  matrix  (sbevd)  forms  Q  explicitly  and  premultiplies  V  by  it.  The  arithmetic  cost  of 
forming  Q  is  approximately  (4/3)n3  [97],  and  the  cost  of  the  matrix  multiplication  is  2 n3. 
In  Table  10.2,  we  do  not  count  the  cost  of  computing  Q,  because  the  Givens  rotations  can 
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be  stored,  reordered,  and  later  applied  to  V  for  a  total  of  2 n3  flops,  although  LAPACK  does 
not  offer  this  functionality. 

The  communication  cost  of  the  band  reduction  is  analyzed  in  Section  10.3.1.1.  Assuming 
the  stored  Givens  rotations  are  applied  to  the  rows  of  V  one  at  a  time  (which  is  how  they  are 
accumulated  in  Q  in  LAPACK),  at  least  one  of  the  rows  must  be  read  from  slow  memory, 
and  the  data  reuse  is  0(1).  This  implies  that  the  bandwidth  cost  of  the  band  reduction, 
which  is  0(n2b ),  is  dominated  by  the  cost  of  the  orthogonal  updates.  In  the  best  case,  if  V 
is  stored  in  row-major  order  and  n  >  M,  the  latency  cost  is  0(n3/M). 

In  [41],  the  authors  consider  an  alternative  approach  for  computing  both  eigenvalues 
and  eigenvectors  of  a  band  matrix,  in  the  context  of  a  2-step  reduction  of  a  full  symmetric 
matrix.  The  band  reduction  scheme  follows  the  algorithm  of  [115]  consisting  of  one  sweep 
(i.e.,  d  =  b  —  1  and  c  =  1).  The  key  idea  from  [41]  is  to  store  all  of  the  Householder  vectors 
and,  instead  of  applying  them  to  V  in  exactly  the  reverse  order  that  they  were  computed, 
to  use  a  reordering  technique  that  respects  the  dependency  pattern.  This  reordering  allows 
for  the  orthogonal  updates  to  be  blocked.  See  Figure  2  in  [41]  or  Figure  2  in  [11]  for 
illustrations  of  this  technique.  Since  the  band  reduction  is  performed  in  one  sweep,  the 
arithmetic  cost  is  2 n3.  Using  the  reordering  technique  with  a  blocking  factor  of  size  @(\/M), 
the  communication  cost  of  the  orthogonal  updates  is  0(n3/\/M).  While  the  orthogonal 
updates  are  performed  efficiently,  the  data  reuse  obtained  during  the  band  reduction  is 
0(1),  as  explained  in  Section  10.3.1.1.  Thus,  the  bandwidth  cost  of  the  band  reduction  is 
0(n2b)  which  dominates  the  total  bandwidth  cost  for  b  3>  n/y/M.  In  the  best  case,  the 
latency  cost  of  the  band  reduction  is  0(n2b/A4).  In  order  to  determine  the  latency  cost  of 
the  orthogonal  updates,  we  assume  the  matrix  V  is  stored  in  column-major  order  and  the 
Householder  vectors  are  written  to  memory  in  the  order  they  are  computed.  Then  every 
application  of  a  block  of  Householder  vectors  involves  0(y/M)  messages,  and  so  the  latency 
cost  is  a  factor  of  0(y/M)  less  than  the  bandwidth  cost.  We  refer  to  this  as  BLS,  given  in 
the  third  row  of  Table  10.2. 

Note  that  this  same  reordering  optimization  from  [41]  can  be  used  to  improve  the  LA¬ 
PACK  algorithm.  That  is,  the  Givens  rotations  may  be  reordered  and  applied  to  V  in  a 
blocked  fashion.  For  examples  of  implementations  for  applying  blocks  of  Givens  rotations, 
see  [123,  147].  If  the  right  block  size  is  chosen,  the  bandwidth  cost  of  the  orthogonal  updates 
can  be  reduced  to  0(n3 /yfM).  We  refer  to  this  algorithm  as  “Improved  LAPACK,”  given  in 
the  second  row  of  Table  10.2.  Because  of  better  alternatives,  we  do  not  discuss  improvements 
in  the  latency  cost. 

We  can  apply  two  optimizations  to  reduce  the  communication  costs  of  the  BLS  algo¬ 
rithm.  First,  as  noted  in  Section  10.3.1.1,  when  b  <C  y/M,  the  communication  costs  of  the 
band  reduction  can  be  improved  by  chasing  0(AI/b2)  bulges  at  a  time,  reducing  both  the 
bandwidth  and  latency  costs  by  a  factor  of  0(M/b2). 

Second,  we  can  reduce  the  latency  cost  in  performing  the  orthogonal  updates  by  storing 
the  eigenvector  matrix  V  in  a  block-contiguous  layout  with  block  size  C-by-C  with  C  = 
0(\fM)  and  by  performing  a  data  layout  transformation  of  the  temporary  data  structure 
of  Householder  vectors.  In  order  to  minimize  bandwidth  cost,  the  Householder  vectors 
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corresponding  to  eliminating  O^y/M)  columns  and  chasing  their  respective  bulges  off  the 
band  should  be  temporarily  stored  before  applying  them  to  Q. 

In  order  to  analyze  the  data  layout  transformation,  we  need  to  consider  the  temporary 
storage  of  Householder  vectors.  If  we  let  H  be  the  temporary  storage  matrix,  then  we  can 
store  each  Householder  vector  associated  with  the  same  eliminated  column  of  A  in  the  same 
column  of  H .  Further,  each  vector  can  occupy  the  rows  of  H  corresponding  to  the  rows  of 
A  it  updated;  in  this  way,  H  is  an  n-by-n  lower  triangular  matrix.  If  one  bulge  is  chased  at 
a  time  and  Householder  vectors  are  written  to  H  in  the  order  they  are  computed,  then  H 
will  have  a  column-major  data  layout.  However,  in  order  to  improve  data  reuse  in  applying 
the  vectors  to  V,  we  want  to  apply  parallelograms  of  vectors  at  a  time,  so  we  need  those 
parallelograms  to  be  stored  contiguously.  The  data  layout  transformation  is  equivalent  to 
transforming  a  matrix  in  column-major  layout  to  a  block-contiguous  layout.  By  applying 
(for  example)  the  Separate  function  given  as  Algorithm  3  in  [32]  to  each  panel  of  width 
(~)(y/M)  a  logarithmic  number  of  times,  we  can  convert  H  from  column-major  to  0(y/rM)- 
by -@(y/M)  block-contiguous  layout  with  total  bandwidth  cost  0(n2  log (n/y/AI))  and  total 
latency  cost  0((n2/M )  log (n/y/M)),  which  are  lower  order  terms  for  n  3>  y/M . 

Note  that  these  two  optimizations  cannot  both  be  applied  straightforwardly  to  the  ap¬ 
proach  of  [41],  as  H  will  not  be  written  in  column-major  order  when  multiple  bulges  are 
chased  at  a  time.  We  claim  that  a  more  complicated  data  layout  transformation  is  possible 
in  the  case  that  multiple  bulges  are  chased  at  a  time.  This  costs  of  this  algorithm  are  given 
as  “Improved  BLSr  in  the  fourth  row  of  Table  10.2.  We  also  claim  it  is  possible  to  apply 
the  second  optimization  to  the  LAPACK  algorithm,  though  the  order  in  which  the  Givens 
rotations  are  computed  and  the  method  for  temporarily  storing  them  is  more  complicated. 

10.3.2.2  CASBR 

Algorithm  10.2  is  a  modification  of  Algorithm  10.1  which  includes  the  explicit  formation  of 
the  matrix  Q,  which  we  store  in  a  block-contiguous  layout  with  C-by-C  blocks.  An  important 
difference  between  the  two  algorithms  is  the  definition  of  uit,  the  number  of  bulges  chased 
at  a  time.  In  Algorithm  10.1,  cUj  is  maximized  under  the  constraint  that  the  working  set 
of  data  to  chase  Ui  bulges  I%  times  each  remains  of  size  O(M).  In  Algorithm  10.2,  ul  is 
further  limited  so  that  the  working  set  of  data  while  applying  the  Householder  updates  to 
a  block  row  of  the  intermediate  Q  matrix  remains  of  size  O(M).  This  working  set  of  data 
now  consists  of  three  components:  a  subset  of  A,  Householder  transformations  (temporarily 
stored  in  a  data  structure  H),  and  blocks  of  Q.  We  will  pick  to  be  approximately  the 
square  root  of  the  previous  choice  so  that  each  of  these  three  components  occupies  no  more 
than  a  third  of  fast  memory.  Reducing  cjj  results  in  more  communication  cost  during  the 
band  reduction,  but  we  will  see  that  this  cost  is  always  dominated  by  that  of  the  orthogonal 
updates.  One  advantage  of  this  approach  is  that,  assuming  the  band  is  too  large  to  fit  into 
fast  memory,  Householder  information  is  never  written  to  slow  memory:  it  is  computed  in 
fast  memory,  all  updates  are  applied,  and  then  the  Householder  entries  are  discarded. 
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In  order  to  validate  the  communication  pattern  described  in  Algorithm  10.2,  we  verify 
three  facts:  2£tbt  columns  of  A  fit  in  one  third  of  fast  memory,  Ti  fits  in  one  third  of 
fast  memory,  and  each  iteration  in  the  ORTHOGONALUPDATES  function  involves  at  most  3 
blocks  of  Q,  which  fit  in  one  third  of  fast  memory.  We  will  show  that  this  is  possible  when 
LUi  <  2y/M/(9(bi  +  1)),  £i  =  (3/2)0;*,  C  =  y/M  / 3,  and  the  assumption  from  above  that 
b  <  y/M/ 3. 

Since  each  column  of  the  band  has  at  most  (3/2)6*  + 1  entries,  the  total  number  of  words  in 
2 Cbi  columns  is  cg* ( (9/ 2)6?  +  36*)  <  M/3.  The  %  data  structure  needs  to  store  Householder 
information  corresponding  to  chasing  o ;*  bulges  £*  times  each,  and  each  bulge  consists  of 
Ci(dj  +  1)  entries.  Thus  T~L  occupies  (3/8)o ;f(6f  +  26*)  <  M/3  words.  Finally,  we  must  also 
verify  that  the  number  of  columns  of  Q  updated  by  the  bulge  chases  (which  correspond 
to  the  rows  of  the  band  that  are  updated)  cannot  span  more  than  3  blocks  of  Q  (he.,  one  third 
of  fast  memory).  By  Lemma  10.1,  the  number  of  columns  is  (3/2)cUj(6*  +  l)  —  bi/2  <  2 y/M / 3; 
since  C  =  y/M /3,  these  columns  cannot  span  more  than  3  blocks. 

Note  that  t  is  defined  differently  here  than  in  Section  10.3.1.2.  Here,  since  we  will 
eliminate  all  subdiagonals  at  once,  we  need  twice  the  working  bandwidth  to  fit  into  fast 
memory. 


10.3.2.3  Arithmetic  Cost 

From  Lemma  1,  the  arithmetic  cost  of  the  orthogonal  updates  is  given  by  2n3  //\=x  d*/6*, 
where  t  is  the  number  of  sweeps,  and  the  cost  of  the  band  reduction  is  always  a  lower 
order  term.  By  the  definition  of  t  and  the  fact  that  d*  =  6*/2,  the  arithmetic  cost  is  then 
n3  min  (log  6,  |~log(2n(6  +  1)/M)]},  ignoring  lower  order  terms. 


10.3.2.4  Bandwidth  Cost 


The  bandwidth  cost  can  be  computed  in  a  similar  way  to  Section  10.3.1.2,  though  cu*  is 
defined  slightly  differently.  The  dominant  communication  cost  is  the  call  to  the  function 
ORTHOGONALUPDATES  within  the  innermost  loop  (lines  7-12).  During  the  ith  sweep,  the 
number  of  sets  of  cu*  bulges  is  n/(cjCU*),  and  for  each  set,  the  innermost  loop  is  executed 
0(n/(£ibi))  times.  Since  TL  resides  in  fast  memory,  the  bandwidth  cost  of  the  function 
ORTHOGONALUPDATES  is  that  of  reading  and  writing  the  row  panels  of  the  Q  matrix: 
0(nC )  words.  Thus,  the  total  bandwidth  cost  of  Algorithm  10.2  is 


Note  that  due  to  the  change  in  definition  of  a?*,  the  bandwidth  cost  of  the  band  reduction 
is  increased  from  0{n2b2  /M)  (from  Section  10.3.1.2)  to  0(n2b/y/M),  but  this  higher  cost  is 
still  dominated  by  that  of  the  orthogonal  updates. 

In  the  case  that  |~log(2n(6  +  1)/M)]  <  log  6,  the  final  step  of  the  algorithm  is  to  read 
the  entire  band  into  memory  and  reduce  all  the  remaining  subdiagonals  at  once,  updating  Q 
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Algorithm  10.2  Sequential  CASBR  with  orthogonal  updates 


Require:  initial  bandwidth  b  <  \[Mf 3  is  a  power  of  2 ,  Q  —  In  is  stored  in  contiguous 
C-by-C  blocks,  H  is  a  temporary  data  structure  of  size  0(M )  which  resides  in  fast 
memory 
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t  =  min  {log  b, 
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M 


} 


for  i  =  1  to  t  do 

_  b  _  h  .1.  .  .  bj  _  o 

ui  2*-i  ?  '-'i  2  ’  ^  2  ?  ^ 

while  not  reached  end  of  band  do 


Cm 

9(6j+l) 


_  3,  , 
—  2UJl 


create  next  set  of  u>i  bulges  ,  storing  Householder  entries  in  H 

OrthogonalUpdates(Q,  H) 


while  not  reached  end  of  band  do 
write  previous  t^bi  columns  of  band 
read  next  t{bi  columns  of  band 

chase  ujt  bulges  ti  times  each  ,  storing  Householder  entries  in  H 

OrthogonalUpdates(<2.  TL) 

end  while 

chase  uii  bulges  off  the  end  of  the  band  ,  storing  Householder  entries  in  H 

OrthogonalUpdates(<2,  H) 

end  while 

copy  band  into  data  structure  with  column  height  |&j+i 

end  for 

if  t  <  log  b  then 

read  remaining  band  into  fast  memory 

reduce  band  to  tridiagonal  in  one  sweep,  updating  Q  with  improved  BLS  algorithm 
write  output  to  slow  memory 

end  if 


23:  function  OrthogonalUpdates(<2,  TL) 

24:  for  i  =  1  to  0  do 

25:  read  at  most  3  blocks  from  ith  block  column  of  Q  into  fast  memory 

26:  apply  Householder  updates  stored  in  H  to  blocks  of  Q 

27:  write  blocks  of  Q  back  to  slow  memory 

28:  end  for 

29:  end  function 
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using  the  second  technique  of  improving  the  BLS  algorithm  (he.,  transforming  the  column- 
major  H  matrix  to  block-contiguous  layout).  In  this  case,  the  bandwidth  cost  of  reading  A 
is  0(nb),  and  the  cost  of  the  orthogonal  updates  is  0(n3 /y/M)  as  explained  above.  Both  of 
these  are  lower  order  terms. 

10.3.2.5  Latency  Cost 

The  latency  cost  is  also  dominated  by  that  of  the  orthogonal  updates.  Since  Q  is  stored 
in  C-by-C  contiguous  blocks,  the  latency  cost  of  the  function  OrthogonalUpdates  is 
0(n/C ).  Thus,  the  latency  cost  of  Algorithm  10.2  simplifies  to  0(fn3/M3//2). 

Like  the  bandwidth  cost,  the  latency  cost  associated  with  the  band  reduction  is  increased 
by  the  choice  of  cu*,  but  this  higher  cost  of  0(tn2/M)  is  still  dominated  by  that  of  the 
orthogonal  updates.  In  the  case  that  [log (nb/M)]  <  log  b,  the  final  step  of  the  algorithm 
using  the  improved  BLS  technique  incurs  a  latency  cost  which  is  also  a  lower  order  term. 


10.4  Parallel  Band  Tridiagonalization  Algorithms 

Recall  our  distributed-memory  parallel  model  described  in  Section  2.2.2,  where  we  have 
P  processors  connected  over  a  network.  Again,  we  will  first  discuss  the  case  of  computing 
eigenvalues  only  and  then  extend  to  the  case  of  computing  both  eigenvalues  and  eigenvectors. 
The  main  improvement  of  our  new  algorithm  over  previous  approaches  is  a  reduction  in 
latency  cost,  both  in  terms  of  the  band  reduction  and  the  back-transformation  phase  (when 
eigenvectors  are  desired). 

We  assume  that  b  <  n/(2>P ),  where  P  is  the  number  of  processors  involved  in  the  band 
reduction.  This  is  a  reasonable  assumption  in  the  context  of  two-step  tridiagonalization,  in 
order  to  minimize  the  latency  cost  in  the  first  step.  For  larger  bandwidths,  one  may  use 
fewer  processors  on  the  first  sweep(s),  or  have  multiple  processors  participate  in  a  single 
bulge  chase.  The  latter  approach  may  incur  a  higher  communication  cost — see  [102], 

10.4.1  Computing  Eigenvalues  Only 

In  this  section  we  concern  ourselves  with  the  case  when  only  eigenvalues  are  desired,  so  the 
orthogonal  updates  may  be  discarded  after  applying  them  to  the  band.  We  collect  the  results 
from  the  analyses  in  Sections  10.4.1.1-10.4.1.2  in  Table  10.3. 

10.4.1.1  Alternate  Approaches 

The  ‘conventional’  distributed  memory  band  tridiagonalization  algorithm  was  introduced  in 
[102],  and  has  been  extended  several  times.  This  is  a  parallelization  of  the  MH  algorithm, 
discussed  in  Section  10.3.1.1,  a  one-sweep  band  reduction  algorithm  (i.e.,  d  =  b  —  1  and 
c  =  1).  We  will  refer  to  this  as  Lang’s  algorithm. 
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Algorithm 

Flops 

Words 

Messages 

Lang  [10,  102] 

o(9] 

l 

0(nb) 

0(n ) 

CASBR 

oh) 

l 

0(nb) 

0(P  log  b) 

Table  10.3:  Asymptotic  comparison  of  previous  parallel  algorithms  for  tridiagonalization  (for 
eigenvalues  only)  with  our  improvements,  for  symmetric  band  matrices  of  n  columns  and 
6  +  1  subdiagonals  on  a  machine  with  P  processors.  The  first  row  assumes  that  P  <  n/b , 
and  the  second  row  assumes  P  <  n/{3b).  The  asymptotic  arithmetic  and  communication 
costs  are  determined  along  the  critical  path. 


We  will  not  present  this  algorithm  and  its  variants,  but  instead  refer  the  reader  to  the 
detailed  complexity  analysis  (and  performance  modeling)  in  [10]  (summarized  in  the  papers 
[11]  and  [12]).  We  present  their  complexity  results  in  asymptotic  notation;  the  hidden 
constant  factors  vary  depending  on  the  optimizations  applied,  including  ‘logical  blocking,’ 
which  eliminates  a  factor  of  2  idle  time  along  the  critical  path,  and  using  a  cyclic  layout,  which 
helps  alleviate  load  imbalance  between  processors.  Along  the  critical  path,  their  algorithm 
performs  0(n2b/P)  flops  and  moves  0(nb )  words.  Because  there  is  a  communication  step 
for  every  column  in  the  band,  the  latency  cost  is  0(n )  messages. 

Unless  multiple  bulges  are  chased  at  a  time,  the  latency  cost  of  O(n)  cannot  be  asymp¬ 
totically  reduced.  That  is,  if  a  message  is  sent  along  the  critical  path  for  every  parallelogram 
annihilated,  then  the  last  sweep,  which  has  one  parallelogram  for  each  column,  will  incur 
O(n)  latency  cost. 

10.4.1.2  CASBR 

The  parallel  CASBR  algorithm  begins  with  a  similar  data  layout  as  Lang’s  algorithm.  Each 
of  the  P  processors  (indexed  0  to  P  —  1)  owns  a  contiguous  set  of  C  =  n/P  columns  of  the 
lower  half  of  the  symmetric  band.  We  use  a  similar  successive  halving  and  multiple  bulge 
chasing  approach  to  the  sequential  CASBR  algorithm.  During  each  sweep,  the  number  of 
columns  per  processor  stays  fixed  at  C  =  n/P.  We  assume  each  of  the  P  processors  has 
Vt/nb/P)  words  of  memory  available,  so  that  the  band  A  can  be  stored  across  the  machine. 
To  simplify  the  presentation,  we  assume  that  3 bP  divides  n,  and  that  b  is  a  power  of  2. 
This  implies  that  P  <  n/(36),  which  is  our  maximum  parallelism.  Note  that  our  maximum 
parallelism  is  three  times  smaller  than  the  Lang  approach;  we  conjecture  that  this  constant 
factor  can  be  improved  by  exploiting  more  overlap  between  the  pipeline  stages  in  the  bulge 
chasing  procedure.  These  constant  factors  will  not  affect  our  asymptotic  analysis. 

Roughly,  the  parallel  algorithm  proceeds  as  each  processor  chases  bulges  through  its  C 
(local)  columns  and  into  the  C  columns  of  its  right  neighbor,  and  then  passes  the  second  set 
of  columns  to  its  right  neighbor.  This  way,  each  of  the  P  processors  accesses  only  O/nb/P) 
of  A  rather  than  streaming  through  the  entire  band.  In  the  algorithm  we  present  below, 
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each  processor  is  active  on  every  other  step;  we  can  eliminate  this  idle  time  by  using  logical 
blocking  (as  in  [102]);  we  ignore  this  factor  of  2  savings  for  the  purposes  of  our  asymptotic 
analysis. 

At  the  high  level,  there  are  four  kernels:  create  bulges,  pass  bulges,  clear  bulges,  and 
create_and_clear_bulges.  The  create_bulges  kernel  eliminates  cuj  parallelograms  (each  with  ct 
columns  and  dl  diagonals)  from  the  local  set  of  C  columns  of  A  and  chases  the  resulting 
bulges  ii  times  (on  average3)  into  the  right  neighbor’s  set  of  C  columns.  The  passTmlges 
kernel  chases  ut  bulges  (created  by  the  left  neighbor)  from  the  local  set  li  times  into  the 
right  neighbor’s  set.  The  create  and  clear  bulges  and  clear .bulges  kernels  are  only  executed 
by  the  last  processor4  and  are  analogous  to  create_bulges  and  passTmlges,  except  the  ‘second 
set  of  columns’  is  off  the  end  of  the  band.  Both  create.bulges  and  passTmlges  require  2 C 
columns  to  pass  information  from  one  processor’s  columns  to  the  next:  the  left  set  of  C 
columns  is  owned  by  the  processor  invoking  the  kernel,  and  the  right  set  is  owned  by  the 
right  neighbor.  The  create.and.clear.bulges  and  clear  bulges  kernels  require  only  the  last  C 
columns  of  the  band  (its  local  set). 

At  any  time,  a  processor  will  have  access  to  and  update  only  its  own  C  columns  and  the 
C  columns  from  its  right  neighbor.  For  example,  the  parallel  algorithm  begins  with  processor 
1  sending  its  columns  to  processor  0.  After  processor  0  executes  the  create.bulges  kernel,  it 
sends  the  updated  second  set  of  C  columns  (with  bulges)  back  to  processor  1.  Processor  1 
must  then  also  receive  processor  2’s  C  columns  in  order  to  execute  the  pass  bulges  kernel. 
The  parallel  algorithm  ends  (on  sweep  %  =  log  b )  with  processor  P  —  1  receiving  C  columns 
from  the  left,  clearing  all  bulges,  and  finally  eliminating  the  last  subdiagonal  of  its  local 
block  (via  create_and_clear .bulges). 

In  order  for  the  pass  bulges  kernel  to  pass  the  bulges  into  the  right  neighbor’s  column 
block,  and  for  the  bulges  to  retain  their  respective  positions  relative  to  the  column  blocks, 
we  set  ti  =  C/bi,  which  is  an  integer  given  the  assumptions  above.  Recall  that  a  bulge  chase 
advances  a  bulge  exactly  bt  columns. 

Our  constraint  on  cUj,  the  maximum  number  of  bulges  that  fits  in  C  —  n/P  columns,  is 
given  by  Lemma  10.2: 


(uii  —  1)(2 bi  —  a)  +  Cj  +  di  <  C . 

A  little  more  care  must  be  taken  when  creating  bulges  to  ensure  that  they  do  not  cross  pro¬ 
cessor  boundaries  (adjacent  sets  of  C  columns).  Consulting  Lemma  10.1,  for  the  successive 
halving  approach,  we  arrive  at  the  following  lemma. 

Lemma  10.8.  Assuming  b  and  oj  are  even,  c  =  d  =  b/2,  and  3 b  divides  C ,  then  we  can  create 
and  chase  u>  =  2C/(3b)  bulges  at  a  time,  and  chasing  them  I  =  C/b  times  each  advances 
them  to  the  next  set  of  C  columns. 

3Note  that  some  bulges  may  need  to  be  chased  up  to  2£,  times,  some  less. 

4Note  that  processor  P  —  2  may  chase  some  bulges  (partially  or  completely)  off  the  end  of  the  band  when 
invoking  pass.bulges  and  create.bulges,  depending  on  the  number  of  columns  owned  by  processor  P  —  1  and 
the  current  bandwidth. 
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As  in  the  sequential  case,  we  fix  the  parameters  to  simplify  the  asymptotic  analysis;  in 
practice,  the  parameters  (including  the  number  of  processors  P'  <  P  used  and  the  number 
of  columns  C  a  processor  owns)  should  be  tuned  independently. 


Algorithm  10.3  Parallel  CASBR 

Require:  3 bP  divides  n,  b  is  a  power  of  2,  processor  ranks  are  between  0  and  P  —  1,  each 
processor  owns  C  =  ^  columns  of  A. 

1:  for  i  —  1  to  log  b  do 

o.  h ■  —  h  r.  —  h  rl.  —  h  r  I  —  2C  —  3, 
z.  2 i— l  ?  W  2  5  ^  2  ’  36^  ’  ^  2^^’ 

3:  if  myrank  >  0  then 

4:  send  left:  block  of  C  columns 

5:  end  if 

6:  for  j  =  1  to  3  •  myrank  do 

7:  receive  from  left:  block  of  C  columns  (includes  bulges) 

8:  if  myrank  =  P  —  1  then 

9:  clear_bulges 

10:  else 

11:  receive  right:  block  of  C  columns 

12:  passJmlges 

13:  send  right:  block  of  C  columns  (includes  bulges) 

14:  end  if 

15:  if  j  <  3  •  myrank  then 

16:  send  left:  block  of  C  columns 

17:  end  if 

18:  end  for 

19:  for  j  =  1  to  3  do 

20:  if  myrank  =  P  —  1  then 

21:  create  and  clear  bulges 

22:  else 

23:  receive  right:  block  of  C  columns 

24:  createJjulges 

25:  send  right:  block  of  C  columns  (includes  bulges) 

26:  end  if 

27:  end  for 

28:  end  for 


We  analyze  the  arithmetic,  bandwidth,  and  latency  costs  along  the  critical  path  of  the 
algorithm.  That  is,  we  follow  the  progress  of  the  first  ay  bulges  from  processor  0  to  processor 
P  —  2,  at  which  point  (exactly)  one  of  processors  P  —  2  and  P  —  1  is  active  chasing  and/or 
clearing  bulges  on  every  remaining  step  of  every  sweep. 
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10.4.1.3  Arithmetic  Cost 

From  Lemma  10.5,  the  arithmetic  cost  of  chasing  one  bulge  (a  single  hop),  with  parameters 
b,  c,  and  d,  is  bounded  above  by  8 bed  +  4 cd2  +  0{bc )  flops,  while  the  cost  of  creating  a  bulge 
and  the  cost  of  chasing  a  bulge  partially  or  completely  off  the  band  are  less.  For  our  choices 
6/2*  =  bi/2  =  Ci  =  di ,  this  cost  is  (5/2 )bf  flops. 

Every  kernel  call  involves  at  most  cjj  bulges;  the  calls  to  createJmlges,  pass_bulges, 
clear  Judges,  and  create_and_clear  Judges  costs  each  chase  the  bulges  about  lx  times,  so  each 
kernel  invocation  costs  about  u)i£i(5bf/2)  =  0(n2bi/P2)  flops.  Following  the  critical  path, 
there  are  (fewer  than)  P  kernel  invocations  while  the  pipeline  fills.  At  this  point,  processors 
P  —  2  and  P  —  1  are  active  for  the  remainder  of  the  execution,  each  invoking  a  kernel  on 
alternating  steps.  There  are  3 p  steps  (iterations  of  the  inner  two  for-loops)  per  sweep,  each 
with  one  kernel  invocation  (along  the  critical  path).  Altogether,  this  is 


O 


O 


n2b 

~P 


flops.  The  hidden  leading  constant  is  about  20;  a  cyclic  layout  and  logical  blocking  as  in 
[102]  can  be  applied  here  to  reduce  this  constant  to  between  5  and  10  (note  these  same 
strategies  reduced  the  corresponding  constant  in  Lang’s  algorithm’s  arithmetic  cost  from  24 
to  between  6  and  12). 


10.4.1.4  Bandwidth  Cost 

Every  message  in  the  algorithm  consists  of  C  columns  of  the  band;  because  of  bulges  and 
triangular  fill  stored  below  the  6jh  subdiagonal,  each  message  (during  the  ith  sweep)  has 
size  (at  most)  C(36j/2  +  l)  =  0(nbi/P)  words.  Following  the  critical  path  as  before,  we  have 
the  upper  bound  of 

logfe 

O  (nbi)  +  ^  O  ( nbi )  =  O  ( nb ) 

1=1 

words  moved. 


10.4.1.5  Latency  Cost 

The  latency  cost  analysis  is  similar  to  the  bandwidth  cost  analysis,  replacing  the  0(nbi) 
terms  by  0(1);  in  total,  we  have  0(P log b)  messages.  This  is  asymptotically  smaller  than 
the  0(n)  messages  that  Lang’s  algorithm  sends:  we  save  a  factor  of  0(n/(P  log  b))  messages. 

10.4.2  Computing  Eigenvalues  and  Eigenvectors 

Recall  our  three  steps:  first,  tridiagonalize  A  =  QTQT ;  second,  compute  the  eigendecom- 
position  T  =  V AVT  with  an  efficient  algorithm;  finally,  back-transform  the  matrix  V  by 
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Algorithm 

Flops 

Words 

Messages 

Lang  [10] 

n  |  n!i)  _ i  n3  \ 

U\Vp+  p) 

°(«A++) 

O  (u  +  l) 

CASBR 

°{VP  +  Tl°0 

o(nb  +  ^\ogb) 

0(yfP\ogb  +  V^Ph^) 

Table  10.4:  Asymptotic  comparison  of  previous  parallel  algorithms  for  tridiagonalization 
(for  eigenvalues  and  eigenvectors)  with  our  improvements,  for  symmetric  band  matrices  of 
n  columns  and  6  +  1  subdiagonals  on  a  machine  with  P  processors.  We  assume  only  0(y/P) 
of  the  processors  participate  in  the  band  reduction  for  both  algorithms.  We  include  the  cost 
of  the  back  transformation  (but  not  the  cost  of  the  tridiagonal  eigendecomposition).  The 
first  row  assumes  \/~P  <  n/b,  and  the  second  row  assumes  \/~P  <  n/(36).  The  asymptotic 
arithmetic  and  communication  costs  are  determined  along  the  critical  path.  The  two  terms 
in  each  cost  correspond  to  the  band  reduction  and  the  back  transformation,  respectively. 


computing  Q V.  We  may  either  store  Q  implicitly  as  a  collection  of  Householder  vectors,  and 
apply  it  using  a  blocked  approach,  or  compute  Q  explicitly  by  applying  the  orthogonal  up¬ 
dates  (from  the  band  reduction)  to  an  identity  matrix,  and  then  compute  QV  with  a  matrix 
multiplication.  As  in  the  sequential  case,  the  computation  and  communication  involved  in 
constructing  and/or  applying  Q  dominates  the  costs  of  the  band  reduction. 

We  assume  V  is  distributed  in  a  2D  blocked  fashion  to  all  P  processors,  and  that  the 
bandwidth  6  of  A  is  (at  most)  1/3  of  the  width  of  a  block  row  of  V,  i.e.,  b  <  n/(3\/rP). 
This  means  that  we  will  use  only  \fP  of  the  P  available  processors  to  perform  the  band 
reduction,  and  all  P  for  the  back-transformation.  So,  we  must  assume  each  processor  has 
D(n2/P)  words  of  memory. 

We  collect  the  results  from  the  analyses  in  Sections  10.4.2.1-10.4.2.2  in  Table  10.4.  Un¬ 
der  our  assumptions,  for  both  algorithms,  the  arithmetic  and  bandwidth  costs  of  the  back- 
transformation  always  dominate  those  of  the  band  reduction.  The  asymptotic  arithmetic 
costs  decrease  linearly  (in  P )  as  expected.  The  first  step  of  two-step  tridiagonalization  can 
attain  the  communication  lower  bounds  for  parallel  dense  linear  algebra  (without  extra  mem¬ 
ory),  i.e.,  D(n2/v/P)  words  moved  and  ^(y/P)  messages,  if  6  =  0(n/y/P).  Asymptotically, 
both  algorithms  attain  the  bandwidth  lower  bound,  up  to  a  factor  of  ©(log (n/y/P))  in  the 
case  of  CASBR.  However,  only  CASBR  attains  the  latency  lower  bound  of  D(y/P),  again  up 
to  a  factor  of  @(log(n/ \fP)). 

10.4.2.1  Alternate  Approaches 

The  approach  in  [11]  stores  Q  implicitly  (as  a  sequence  of  Householder  transformations)  and 
then  applies  Q  to  V  in  a  blocked  fashion.  The  authors  give  three  algorithms  to  compute  QV, 
with  different  parallel  layouts  of  the  matrix  V — we  consider  only  their  best  approach,  based 
on  a  2D  layout  which  is  dynamically  rebalanced.  Before  computing  QV,  we  assume  each 
processor  owns  a  (n/y/P)-by-(n/yfP)  block  of  V.  Again,  we  refer  the  reader  to  the  detailed 
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analysis  in  [10].  Along  the  critical  path,  the  additional  costs  for  the  back-transformation  are 
0(n3/P)  flops,  0(n2/\/P)  words  moved,  and  0(n/b )  messages. 

10.4.2.2  CASBR 

As  in  the  sequential  case  (Section  10.3.2.2),  we  construct  Q  explicitly  rather  than  storing 
it  implicitly.  The  extra  cost  of  the  matrix  multiplication  QV  is  dominated  by  the  cost  of 
constructing  Q  and  thus  will  not  affect  our  asymptotic  analysis.  Again,  in  practice,  this  cost 
can  be  avoided  by  storing  and  applying  Q  to  V  as  a  sequence  of  Householder  transformations. 

By  the  assumption  \[P  <  n/ (36),  we  can  involve  all  \/~P  processors  in  each  processor  row 
in  a  band  reduction.  Since  the  arithmetic  cost  for  the  band  reduction  is  a  lower  order  term, 
we  can  afford  to  perform  the  band  reduction  \/~P  times  redundantly  (or  once,  but  only  on  a 
subset  of  \/~P  processors).  We  distribute  the  band  A  to  each  row  of  the  given  v^P-by-v^P 
processor  grid;  each  row  performs  the  band  reduction  once.  Note  that  each  processor  owns 
C  =  n/y/P  columns  of  A,  rather  than  n/P  (as  before). 

We  use  Algorithm  10.4,  a  modification  of  Algorithm  10.3,  which  simultaneously  computes 
the  band  reduction  and  the  n-by-n  matrix  Q.  That  is,  we  postmult iply  an  n-by-n  identity 
matrix  In  by  each  orthogonal  matrix  Q i,  Q2, . . .,  generated  by  the  bulge  chasing  procedure. 
(To  simplify  the  presentation,  we  will  again  refer  to  the  intermediate  products  In  ■  Q\  ■ 
Q 2  •  •  •  also  as  Q,  and  the  intermediate  band  matrices  all  as  A.)  These  orthogonal  updates 
combine  columns  of  Q  (but  not  rows);  thus,  each  processor  row  may  work  independently. 
Each  processor  row  is  assigned  C  contiguous  rows  of  Q ;  the  columns  of  this  block  row 
are  distributed  according  to  the  distribution  of  the  band  matrix.  That  is,  if  processor  i 
(indexed  within  a  given  processor  row)  owns  the  first  element  of  the  jth  row  of  A,  then 
processor  i  will  own  the  jth  column  of  the  corresponding  block  row  of  Q.  In  this  way,  the 
communication  pattern  of  the  blocks  of  Q  between  neighboring  processors  will  exactly  match 
the  communication  pattern  of  the  blocks  of  the  band.  Whenever  a  processor  performs  a  local 
kernel  on  2C  columns  of  the  band,  it  will  also  apply  all  of  those  updates  to  2 C  columns  of  (its 
block  row  of)  Q.  This  implies  that  in  sweep  i,  within  each  processor  row,  the  first  processor 
owns  the  first  C  +  bi  columns  of  the  corresponding  block  row  of  Q,  each  subsequent  processor 
owns  the  next  C  columns,  and  the  last  processor  owns  the  last  C  —  bi  columns.  (Note  that  the 
Erst  processor  does  not  touch  the  first  bi/2  columns,  but  rather  stores  them  to  be  updated  in 
the  next  sweep.)  This  distribution  also  implies  that  between  sweeps  i  and  i  + 1,  the  Q  matrix 
must  be  shifted  to  maintain  the  relationship  between  the  ownership  of  rows  of  the  band  and 
the  columns  of  Q.  To  simplify  the  presentation,  we  assume  that  on  each  sweep  i,  Q  is  padded 
with  bj  zero  columns,  and  that  the  first  processor  in  each  row  always  sends  its  rightmost  C 
columns;  under  these  assumptions,  each  processor  always  sends/receives  C-by-C  blocks  of 
Q,  avoiding  fringe  cases  for  the  Erst  and  last  processors  (within  each  processor  row). 

For  the  orthogonal  updates  of  Q ,  we  introduce  four  new  kernels — create_bulges_update, 
pass_bulges_update,  create_and_clear_bulges_update,  and  clear_bulges_update — which  apply 
the  right  orthogonal  updates  (as  sets  of  Householder  transformations)  from  the  corresponding 
band  reduction  kernels  to  the  local  blocks  of  Q. 
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Again,  we  do  not  analyze  computing  the  eigendecomposition  of  T,  but  we  assume  that 
this  step  terminates  with  V  distributed  across  the  processor  grid  with  each  processor  owning 
a  C-by-C  block  of  V.  We  then  compute  QV  using  matrix-matrix  multiplication. 

In  the  following  complexity  analysis,  we  count  only  the  additional  work  and  communi¬ 
cation  done  for  the  orthogonal  updates.  To  obtain  the  results  for  CASBR  in  Table  10.4, 
we  simply  add  the  the  band  reduction  costs  (Section  10.4.1.2),  substituting  y/p  for  p  (since 
now  we  run  the  band  reduction  redundantly).  Then  we  add  the  cost  of  multiplying  QV 
with  Cannon’s  algorithm  [49],  which  costs  2 n3/p  flops,  0(n2/y/p)  words  moved,  and  0(y/p) 
messages.  These  are  all  lower  order  terms,  due  to  the  logarithmic  factors  in  the  other  costs. 


10.4.2.3  Arithmetic  Cost 


As  argued  in  Section  10.4.1.2,  there  are  at  most  c OiQ  bulges  chased  in  the  pass_bulges, 
clear  Judges,  and  create_and_clear_bulges  kernels,  and  at  most  2ui£i  bulges  chased  in  the  cre¬ 
ate  Judges  kernel.  Since  the  number  of  Householder  entries  in  each  bulge  chase  is  Qdj  =  b2/ 4, 
from  Lemma  10.3,  the  cost  of  applying  the  updates  from  one  kernel  invocation  to  nf  y/P  rows 
of  the  Q  matrix  is  at  most 

b2  n  2 n3  _  /  n3 

4  '  I  '  7F  '  ^  ~  3 ~  °  [pic 

flops  (and  up  to  2  times  more  for  create  Judges). 

Following  the  analysis  in  Section  10.4.1.2,  we  can  upper  bound  the  additional  arithmetic 
performed  along  the  critical  path  by 
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flops.  The  costs  of  the  band  reduction  and  multiplication  QV  are  lower  order  terms. 


10.4.2.4  Bandwidth  Cost 


The  communication  costs  of  the  orthogonal  updates  are  also  analogous  to  band  reduction. 
As  shown  in  Algorithm  10.4,  for  every  message  sent /received  containing  a  block  of  A,  there  is 
a  second  message  containing  a  block  of  Q.  (The  additional  message  every  sweep  to  shift  the 
block  row  of  Q  amounts  to  a  lower  order  term.)  However,  while  the  size  of  the  A  messages 
decreases  with  the  bandwidth,  the  size  of  the  Q  messages  remains  the  same  (n2 / P  words). 
The  additional  bandwidth  cost,  following  the  analysis  in  Section  10.4.1.2,  is  bounded  by 
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words  moved.  Again,  the  cost  of  the  band  reduction  and  multiplication  QV  are  lower  order 
terms. 
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Algorithm  10.4  Parallel  SBR  with  orthogonal  updates 

Require:  3 b\/P  divides  n,  b  is  a  power  of  2.  Processor  ranks  are  with  respect  to  the  processor  row  (i.e., 
between  0  and  \]P  —  1).  Within  each  processor  row,  each  processor  stores  C  =  -^=  columns  of  A,  and 
C-by-C  (or  C-by-(C'±  &,;))  block  of  Q,  whose  column  indices  correspond  to  the  indices  of  the  local  rows 
of  A  whose  first  (leftmost)  nonzero  is  stored  locally, 
for  *  =  1  to  log  b  do 

h.  —  k  c.  —  h  A.  —  h  u.  —  2C 

if  myrank  >  0  then 

send  left:  block  of  C  columns  of  A 
send  left:  block  of  C  columns  and  rows  of  Q 

end  if 

for  j  =  1  to  3  •  myrank  do 

receive  from  left:  block  of  C  columns  of  A  (includes  bulges) 
receive  from  left:  block  of  C  columns  and  rows  of  Q 
if  myrank  =  \fP  —  1  then 
clear  .bulges 
clear  _bulges_update 
else 

receive  from  right:  block  of  C  columns  of  A 
receive  from  right:  block  of  C  columns  and  rows  of  Q 
passJxilges 
pass_bulges_update 

send  right:  block  of  C  columns  of  A  (includes  bulges) 
send  right:  block  of  C  columns  and  rows  of  Q 

end  if 

if  j  <  3  •  myrank  then 

send  left:  block  of  C  columns  of  A 
send  left:  block  of  C  columns  and  rows  of  Q 

end  if 
end  for 

for  q  =  1  to  3  do 

if  myrank  =  yfP  —  1  then 
create_and_clear  .bulges 
create  .and.clear.bulges.update 

else 

receive  from  right:  block  of  C  columns  of  A 
receive  from  right:  block  of  C  columns  and  C  rows  of  Q 
create  .bulges 
create  .bulges  .up  date 

send  right:  block  of  C  columns  of  A  (includes  bulges) 
send  right:  block  of  C  columns  and  rows  of  Q 

end  if 
end  for 

if  myrank  <  \J P  —  1  then 

send  right:  block  of  6,/ 2  columns  and  C  rows  of  Q. 
else  if  myrank  >  0  then 

receive  left:  block  of  6,/ 2  columns  and  C  rows  of  Q. 

end  if 
end  for 
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10.4.2.5  Latency  Cost 

The  additional  latency  cost  is  the  same  as  that  for  the  band  reduction  (see  Section  10.4.1.2) 
plus  the  shift  (a  lower  order  term),  he.,  0(yfP  log  b)  messages.  In  the  more  restrictive  case 
\[P  <  n/(blogb),  this  is  an  asymptotic  improvement  compared  to  Lang’s  algorithm  for  just 
the  back-transformation  phase;  considering  also  the  cost  of  the  band  reduction,  we  always 
have  an  asymptotic  improvement. 


10.5  Conclusions 

In  theory,  both  band  reduction  and  dense  matrix- matrix  multiplication  have  0(n )  possible 
data  reuse  in  the  sequential  case,  given  by  the  ratio  of  total  flops  to  size  of  inputs  and  outputs. 
When  the  problem  does  not  fit  in  fast  memory  (of  size  M  words),  matrix  multiplication  can 
attain  only  0(\/M)  data  reuse  [88],  while  our  CASBR  algorithm  achieves  0(M/b )  reuse, 
provided  b  <  \[M /3.  This  constraint  on  b  also  ensures  that  the  reuse  is  always  asymptotically 
at  least  as  large  as  that  of  matrix  multiplication,  and  when  b  <C  \[M,  we  can  actually  attain 
much  better  reuse. 

Indeed,  improved  data  reuse  often  translates  to  better  performance.  In  [31],  we  observed 
that  using  the  techniques  of  reducing  communication  (even  at  the  expense  of  some  extra 
arithmetic),  as  well  as  a  framework  that  automatically  tuned  the  algorithmic  parameters, 
led  to  speedups  of  2— 6x  on  sequential  and  shared-memory  parallel  machines.  We  believe  that 
these  benefits  will  extend  to  the  distributed-memory  case,  particularly  when  performance  is 
latency-bound. 

The  performance  results  in  [31]  focused  on  the  case  of  computing  eigenvalues  only  and 
did  not  include  the  cost  of  the  back-transformation  phase.  In  that  case,  the  arithmetic  cost 
increased  by  no  more  than  50%.  As  we  have  seen,  the  cost  of  the  back-transformation, 
which  dominates  that  of  the  band  reduction  when  eigenvectors  are  requested,  increases  with 
the  number  of  sweeps.  For  example,  for  the  successive  halving  approach,  the  increase  in 
arithmetic  was  a  factor  of  0(log  b).  Thus,  there  exists  an  important  tradeoff  between  reducing 
communication  in  the  band  reduction  phase  and  the  resulting  increased  costs  in  the  back- 
transformation  phase.  Note  that  when  computing  partial  eigensystems,  the  costs  of  the 
back-transformation  can  be  reduced  to  be  proportional  to  the  number  of  eigenvectors  desired, 
improving  this  tradeoff. 

W  also  do  not  give  algorithms  or  complexity  analysis  for  taking  more  than  1  and  less 
than  log  b  sweeps  and  using  the  technique  of  chasing  multiple  bulges.  Indeed,  we  fixed  many 
parameters  here  with  the  sole  intention  of  simplifying  the  theoretical  analysis.  In  practice, 
parameters  such  as  the  number  of  sweeps  and  the  number  of  bulges  chased  at  a  time  should 
be  autotuned  for  the  target  architecture  to  navigate  the  tradeoffs  mentioned  above. 

Recall  our  application  of  two-step  tridiagonalization  for  the  symmetric  eigenproblem. 
The  first  step  (full-to-banded)  and  its  corresponding  back-transformation  phase,  can  be  per¬ 
formed  efficiently  [17,  109].  Combined  with  the  approaches  here  for  the  second  step  (and 
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an  efficient  tridiagonal  eigensolver) ,  we  have  sequential  and  parallel  algorithms  for  the  sym¬ 
metric  eigenproblem  that  attain  the  communication  lower  bounds  for  dense  linear  algebra  in 
[28]  up  to  0(log  b)  factors:  in  the  sequential  case,  D(n3/y/M)  words  moved  and  h2(n3/M3//2) 
messages,  in  the  parallel  case  (if  minimal  memory  is  used),  f l(n2/y/P)  words  moved  and 
Ut(y/P)  messages.  Even  though  these  lower  bounds  formally  apply  to  only  the  first  step, 
they  are  still  valid  lower  bounds  for  any  algorithm  that  performs  this  step. 

We  also  remark  that,  in  the  sequential  case,  similar  techniques  to  those  used  in  a 
communication-optimal  first  step  can  also  be  applied  in  the  case  b  >  \f . ~M/3  (violating 
an  assumption  in  Section  10.3).  In  fact,  for  any  b+  1  <  n,  we  can  reduce  to  tridiagonal  form 
with  communication  costs  that  attain  (or  beat)  the  aforementioned  bounds. 
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Chapter  11 

Communication- Avoiding  Parallel 
Strassen 


In  this  chapter,  we  consider  the  parallelization  of  Strassen’s  fast  matrix  multiplication  algo¬ 
rithm.  Our  main  contribution  is  a  new  algorithm  we  call  Communication-Avoiding  Parallel 
Strassen,  or  CAPS,  that 

•  perfectly  load  balances  the  0(nlg7)  flops  across  processors, 

•  is  communication  optimal,  attaining  the  bandwidth  and  latency  cost  lower  bounds  of 
Chapter  5  and  Section  6. 2. 1.2  up  to  a  logarithmic  factor  in  the  number  of  processors, 

•  requires  asymptotically  less  communication  than  previous  parallelizations  of  Strassen’s 
algorithm, 

•  requires  asymptotically  less  computation  and  communication  than  all  classical  algo¬ 
rithms,  and 

•  outperforms  in  practice  all  other  known  implementations  of  matrix  multiplication, 
Strassen-based  or  classical. 

The  algorithm  and  its  computational  and  communication  cost  analyses  are  presented  in 
Section  11.2.  There  we  show  it  matches  the  communication  lower  bounds.  We  provide  a 
review  and  analysis  of  previous  algorithms  in  Section  11.3.  We  also  consider  two  natural 
combinations  of  previously  known  algorithms  (Sections  11.3.4  and  11.3.5).  One  of  these  new 
algorithms  that  we  call  “2.5D-Strassen”  performs  better  than  all  previous  algorithms,  but  is 
still  not  optimal,  and  is  inferior  to  CAPS. 

We  discuss  our  implementations  of  the  new  algorithms  and  compare  their  performance 
with  previous  ones  in  Section  11.4  to  show  that  our  new  CAPS  algorithm  outperforms 
previous  algorithms  not  just  asymptotically,  but  also  in  practice.  Benchmarking  our  imple¬ 
mentation  on  a  Cray  XT4,  we  obtain  speedups  over  classical  and  Strassen-based  algorithms 
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ranging  from  24%  to  184%  for  a  fixed  matrix  dimension  n  =  94080,  where  the  number  of 
nodes  ranges  from  49  to  7203. 

In  Section  11.5  we  show  that  our  parallelization  method  applies  to  other  fast  matrix 
multiplication  algorithms.  It  also  applies  to  classical  recursive  matrix  multiplication,  thus 
obtaining  a  new  optimal  classical  algorithm  that  matches  the  communication  complexity 
of  the  2.5D  algorithm  of  Solomonik  and  Demmel  [137].  In  Section  11.5,  we  also  discuss 
numerical  stability,  hardware  scaling,  and  future  work. 

The  main  results  of  this  chapter  appear  in  [20],  written  with  coauthors  James  Demmel, 
Olga  Holtz,  Benjamin  Lipshitz,  and  Oded  Schwartz.  A  subsequent  paper  [105],  written  with 
James  Demmel,  Benjamin  Lipshitz,  and  Oded  Schwartz,  details  the  implementation  of  the 
algorithm  on  multiple  machines  and  presents  more  extensive  performance  data.  Much  of 
that  content  also  appears  in  [106]. 


11.1  Preliminaries 

11.1.1  Strassen’s  Algorithm 

Strassen  showed  that  2x2  matrix  multiplication  can  be  performed  using  7  multiplications  and 
18  additions,  instead  of  the  classical  algorithm  that  does  8  multiplications  and  4  additions 
[139].  By  recursive  application  this  yields  an  algorithm  with  multiplies  two  n  x  n  matrices 
O((nuo)  flops,  where  ujq  =  log2  7  ~  2.81  (see  Section  2.4.1).  Winograd  improved  the  algorithm 
to  use  7  multiplications  and  15  additions  in  the  base  case,  thus  decreasing  the  hidden  constant 
in  the  O  notation  [152],  Our  implementation  uses  the  Winograd  variant  (see  Section  2.4.2 
for  details). 

11.1.2  Previous  Work  on  Parallel  Strassen 

In  this  section  we  briefly  describe  previous  efforts  to  parallelize  Strassen.  More  details, 
including  communication  analyses,  are  in  Section  11.3.  A  summary  appears  in  Table  11.1. 

Luo  and  Drake  [108]  explored  Strassen-based  parallel  algorithms  that  use  the  communi¬ 
cation  patterns  known  for  classical  matrix  multiplication.  They  considered  using  a  classical 
2D  parallel  algorithm  and  using  Strassen  locally,  which  corresponds  to  what  we  call  the  “2D- 
Strassen”  approach  (see  Section  11.3.2).  They  also  consider  using  Strassen  at  the  highest 
levels  and  performing  a  classical  parallel  algorithm  for  each  subproblem  generated,  which  cor¬ 
responds  to  what  we  call  the  “Strassen-2D”  approach.  The  size  of  the  subproblems  depends 
on  the  number  of  Strassen  steps  taken  (see  Section  11.3.3).  Luo  and  Drake  also  analyzed 
the  communication  costs  for  these  two  approaches. 

Soon  after,  Grayson,  Shah,  and  van  de  Geijn  [78]  improved  on  the  Strassen-2D  approach 
of  [108]  by  using  a  better  classical  parallel  matrix  multiplication  algorithm  and  running  on  a 
more  communication-efficient  machine.  They  obtained  better  performance  results  compared 
to  a  purely  classical  algorithm  for  up  to  three  levels  of  Strassen’s  recursion. 
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Kumar,  Huang,  Johnson,  and  Sadayappan  [101]  implemented  Strassen’s  algorithm  on 
a  shared-memory  machine.  They  identified  the  tradeoff  between  available  parallelism  and 
total  memory  footprint  by  differentiating  between  “partial”  and  “complete”  evaluation  of 
the  algorithm,  which  corresponds  to  what  we  call  depth-first  and  breadth-first  traversal  of 
the  recursion  tree  (see  Section  11.2.1).  They  show  that  by  using  I  DFS  steps  before  using 
BFS  steps,  the  memory  footprint  is  reduced  by  a  factor  of  (7/4 )e  compared  to  using  all  BFS 
steps.  They  did  not  consider  communication  costs  in  their  work. 

Other  parallel  approaches  [64,  90,  138]  have  used  more  complex  parallel  schemes  and 
communication  patterns.  However,  they  restrict  attention  to  only  one  or  two  steps  of  Strassen 
and  obtain  modest  performance  improvements  over  classical  algorithms. 

11.1.3  Lower  Bounds  for  Strassen’s  Algorithm 

The  main  impetus  for  this  work  was  the  observation  of  the  asymptotic  gap  between  the 
communication  costs  of  existing  parallel  Strassen-based  algorithms  and  the  communication 
lower  bounds  given  by  Theorems  5.10  and  6.8.  Because  of  the  attainability  of  the  lower 
bounds  in  the  sequential  case,  we  hypothesized  that  the  gap  could  be  closed  by  finding  a 
new  algorithm  rather  than  by  tightening  the  lower  bounds. 

We  made  three  observations  from  the  lower  bound  results  of  Chapter  5  that  led  to  the 
new  algorithm.  First,  the  lower  bounds  for  Strassen  are  lower  than  those  for  classical  matrix 
multiplication.  This  implies  that  in  order  to  obtain  an  optimal  Strassen-based  algorithm,  the 
communication  pattern  for  an  optimal  algorithm  cannot  be  that  of  a  classical  algorithm  but 
must  reflect  the  properties  of  Strassen’s  algorithm.  Second,  the  factor  Mw°/2-1  that  appears 
in  the  denominator  of  the  communication  cost  lower  bound  implies  that  an  optimal  algorithm 
must  use  as  much  local  memory  as  possible.  That  is,  there  is  a  tradeoff  between  memory 
usage  and  communication  (the  same  is  true  in  the  classical  case).  Third,  the  proof  of  the 
lower  bounds  shows  that  in  order  to  minimize  communication  costs  relative  to  computation, 
it  is  necessary  to  perform  each  submatrix  multiplication  of  size  @(a/M)  x  0(\/M)  on  a  single 
processor. 

With  these  observations  and  assisted  by  techniques  from  previous  approaches  to  paral¬ 
lelizing  Strassen,  we  developed  a  new  parallel  algorithm  which  achieves  perfect  load  balance, 
minimizes  communication  costs,  and  in  particular  performs  asymptotically  less  computation 
and  communication  than  is  possible  using  classical  matrix  multiplication. 


11.2  The  Algorithm 

In  this  section  we  present  the  CAPS  algorithm,  and  prove  it  is  communication  optimal.  See 
Algorithm  11.1  for  a  concise  presentation  and  Algorithm  11.2  for  a  more  detailed  description. 
We  use  notation  for  the  Strassen- Winograd  algorithm  from  Section  2.4.2. 
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Figure  11.1:  Representation  of  BFS  and  DFS  steps.  In  a  BFS  step,  all  seven  subproblems 
are  computed  at  once,  each  on  1/7  of  the  processors.  In  a  DFS  step,  the  seven  subproblems 
are  computed  in  sequence,  each  using  all  the  processors.  The  notation  follows  that  of  Section 
2.4.2. 

11.2.1  Overview  of  CAPS 

Consider  the  recursion  tree  of  Strassen’s  sequential  algorithm.  CAPS  traverses  it  in  parallel 
as  follows.  At  each  level  of  the  tree,  the  algorithm  proceeds  in  one  of  two  ways.  A  “breadth- 
first-step”  (BFS)  divides  the  7  subproblems  among  the  processors,  so  that  |  of  the  processors 
work  on  each  subproblem  independently  and  in  parallel.  A  “depth- first-step”  (DFS)  uses  all 
the  processors  on  each  subproblem,  solving  each  one  in  sequence.  See  Figure  11.1. 

In  short,  a  BFS  step  requires  more  memory  but  reduces  communication  costs  while  a  DFS 
step  requires  little  extra  memory  but  is  less  communication-efficient.  In  order  to  minimize 
communication  costs,  the  algorithm  must  choose  an  ordering  of  BFS  and  DFS  steps  that 
uses  as  much  memory  as  possible. 

Let  k  =  log7  P  and  s  >  k  be  the  number  of  distributed  Strassen  steps  the  algorithm  will 
take.  In  this  section,  we  assume  that  n  is  a  multiple  of  2s7^k^2\  If  k  is  even,  the  restriction 
simplifies  to  n  being  a  multiple  of  2S\/~P .  Since  P  is  a  power  of  7,  it  is  sometimes  convenient 
to  think  of  the  processors  as  numbered  in  base  7.  CAPS  performs  s  steps  of  Strassen’s 
algorithm  and  finishes  the  calculation  with  local  matrix  multiplication.  The  algorithm  can 
easily  be  generalized  to  other  values  of  n  by  padding  or  dynamic  peeling. 

We  consider  two  simple  schemes  of  traversing  the  recursion  tree  with  BFS  and  DFS  steps. 
The  first  scheme,  which  we  call  the  Unlimited  Memory  (UM)  scheme,  is  to  take  k  BFS  steps 
in  a  row.  This  approach  is  possible  only  if  there  is  sufficient  available  memory.  The  second 
scheme,  which  we  call  the  Limited  Memory  (LM)  scheme  is  to  take  £  DFS  steps  in  a  row 
followed  by  k  BFS  steps  in  a  row,  where  £  is  minimized  subject  to  the  memory  constraints. 

It  is  possible  to  use  a  more  complicated  scheme  that  interleave  BFS  and  DFS  steps  to 
reduce  communication.  We  show  that  the  LM  scheme  is  optimal  up  to  a  constant  factor, 
and  hence  no  more  than  a  constant  factor  improvement  can  be  attained  from  interleaving. 
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Algorithm  11.1  CAPS,  in  brief.  For  more  details,  see  Algorithm  11.2. 
Require:  A,  B,  n,  where  A  and  B  are  n  x  n  matrices 
P  =  number  of  processors 

Ensure:  C  =  A  ■  B 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 
17 


>  The  dependence  of  the  Si  s  on  A,  the  T*’s  on  B  and  C  on  the  Q/  s  follows  the 
Strassen-Winograd  algorithm.  See  Section  2.4.2. 
procedure  C  =  CAPS(A,  B ,  n,  P ) 

if  enough  memory  then  >  Do  a  BFS  step 

locally  compute  the  Si  s  and  Tj’s  from  A  and  B 
while  i  =  1 ...  7  do 
redistribute  Si  and  Tl 
Qi  =  CAPS  (Si,  Th  nf  2,  P/7) 
redistribute  Qi 
end  while 

locally  compute  C  from  all  the  Q/s 

else  >  Do  a  DFS  step 

for  i  —  1 ...  7  do 

locally  compute  S,  and  T,  from  A  and  B 

Qt  =  CAPS  (Si,  Ti ,  n/2,  P ) 

locally  compute  contribution  of  Qt  to  C 

end  for 
end  if 

end  procedure 


11.2.2  Data  Layout 

We  require  that  the  data  layout  of  the  matrices  satisfies  the  following  two  properties: 

1.  At  each  of  the  s  Strassen  recursion  steps,  the  data  layouts  of  the  four  submatrices  of 
each  of  A,  B ,  and  C  must  match  so  that  the  weighted  additions  of  these  submatrices 
can  be  performed  locally.  This  technique  follows  [108]  and  allows  communication-free 
DFS  steps. 

2.  Each  of  these  sub  matrices  must  be  equally  distributed  among  the  P  processors  for  load 
balancing. 

There  are  many  data  layouts  that  satisfy  these  properties,  perhaps  the  simplest  being  block- 
cyclic  layout  with  a  processor  grid  of  size  7Lfc/2J  x  7^fc//2^  and  block  size  [^/2J  x  2s7^fc/2] . 
(When  k  =  log7  P  is  even  these  expressions  simplify  to  a  processor  grid  of  size  \/~P  x  y/P 
and  block  size  See  Section  2.3.2  for  a  description  of  block-cyclic  layout  and  Figure  2.3 

for  an  example  with  P  =  49. 

Any  layout  that  we  use  is  specified  by  three  parameters,  (n,  P,  s),  and  intermediate  stages 
of  the  computation  use  the  same  layout  with  smaller  values  of  the  parameters.  A  BFS  step 
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reduces  a  multiplication  problem  with  layout  parameters  (n,  P,  s )  to  seven  subproblems  with 
layout  parameters  (n/2,  P/7,  s  —  1).  A  DFS  step  reduces  a  multiplication  problem  with 
layout  parameters  (n,  P,  s )  to  seven  subproblems  with  layout  parameters  (n/2,  P,  s  —  1). 
Note  that  if  the  input  data  is  initially  load-balanced  but  distributed  using  a  different 

layout,  we  can  rearrange  it  to  the  above  layout  using  a  total  of  O  words  moved  and 

0(n 2)  messages.  This  has  no  asymptotic  effect  on  the  bandwidth  cost  but  significantly 
increases  the  latency  cost  in  the  worst  case. 

11.2.3  Unlimited  Memory  Scheme 

In  the  UM  scheme,  we  take  k  =  log7  P  BFS  steps  in  a  row.  Since  a  BFS  step  reduces  the 
number  of  processors  involved  in  each  subproblem  by  a  factor  of  7,  after  k  BFS  steps  each 
subproblem  is  assigned  to  a  single  processor,  and  so  is  computed  locally  with  no  further 
communication  costs.  We  first  describe  a  BFS  step  in  more  detail. 

The  matrices  A  and  B  are  initially  distributed  as  described  in  Section  11.2.2.  In  order  to 
take  a  recursive  step,  the  14  matrices  S\, . . .  Sj,  T\ ,  . . .  ,  T7  must  be  computed  (the  notation 
follows  that  of  Section  2.4.2).  Each  processor  allocates  space  for  all  14  matrices  and  per¬ 
forms  local  additions  and  subtractions  to  compute  its  portion  of  the  matrices.  Recall  that 
the  submatrices  are  distributed  identically,  so  this  step  requires  no  communication.  If  the 
layouts  of  A  and  B  have  parameters  (n,  P,  s ),  the  Si  and  the  T*  now  have  layout  parameters 
(n/2,  P,s  —  1). 

The  next  step  is  to  redistribute  these  14  matrices  so  that  the  7  pairs  of  matrices  (Si,  T)) 
exist  on  disjoint  sets  of  P/7  processors.  This  requires  disjoint  sets  of  7  processors  performing 
an  all-to-all  communication  step  (each  processor  must  send  and  receive  a  message  from  each 
of  the  other  6).  To  see  this,  consider  the  numbering  of  the  processors  base-7.  On  the 
mth  BFS  step,  the  communication  is  between  the  seven  processors  whose  numbers  agree 
on  all  digits  except  the  mth  (counting  from  the  right).  After  the  mth  BFS  step,  the  set  of 
processors  working  on  a  given  subproblem  share  the  same  m-digit  suffix.  After  the  above 
communication  is  performed,  the  layout  of  Si  and  T%  has  parameters  (n/2,  P/7,  s  —  1),  and 
the  sets  of  processors  that  own  the  7)  and  Si  are  disjoint  for  different  values  of  i.  Note  that 
since  each  all-to-all  only  involves  seven  processors  no  matter  how  large  P  is,  this  algorithm 
does  not  have  the  scalability  issues  that  typically  come  from  an  all-to-all  communication 
pattern. 

11.2.3.1  Memory  Requirements 

The  extra  memory  required  to  take  one  BFS  step  is  the  space  to  store  all  7  triples  Sj,  Tj , 
Qj.  Since  each  of  those  matrices  is  |  the  size  of  A,  B,  and  C,  the  extra  space  required  at 
a  given  step  is  7/4  the  extra  space  required  for  the  previous  step.  We  assume  that  no  extra 
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Algorithm  11.2  CAPS,  in  detail 

Require:  A,  B,  are  n  x  n  matrices 

P  =  number  of  processors 

rank  =  processor  number  base-7  as  an  array 

M  =  local  memory  size 

Ensure:  C  =  A  ■  B 

1 

procedure  C  =  CAPS(A,  B ,  P,  rank,  M) 

2 

e  =  [loS2  Pi,J?Ml/a  ] 

t>  £  is  number  of  DFS  steps  to  fit  in  memory 

3 

k  =  log7  P 

4 

call  DFS(A,  B,  C,  k,  £,  rank) 

5 

end  procedure 

1 

procedure  DFS(A,  B ,  C,  k,  £,  rank) 

>  Do  C  =  A  ■  B  by  £  DFS,  then  k  BFS  steps 

2 

if  l  <  0  then  call  BFS(  A,  B ,  C,  k,  rank);  return 

3 

end  if 

4 

for  i  =  1 ...  7  do 

5 

locally  compute  Si  and  T)  from  A  and  B 

t>  following  Strassen-Winograd 

6 

7 

call  DFS(  Si,  Ti,  Qi,  k,  £  —  1,  rank  ) 
locally  compute  contribution  of  Qi  to  C 

t>  following  Strassen-Winograd 

8 

end  for 

9 

end  procedure 

1 

procedure  BFS  (A,  B ,  C,  k,  rank) 

p  Do  C  =  A  -  B  by  k  BFS  steps,  then  local  Strassen 

2 

if  k  ==  0  then  call  localStrassen(A,  B,  C);  return 

3 

end  if 

4 

for  i  =  1 ...  7  do 

5 

locally  compute  Si  and  Ti  from  A  and  B 

>  following  Strassen-Winograd 

6 

end  for 

7 

for  *  =  1 ...  7  do 

8 

target  =  rank 

9 

target  [k]  =  i 

10 

send  Si  to  target 

11 

receive  into  L 

t>  One  part  of  L  comes  from  each  of  7  processors 

12 

send  Ti  to  target 

13 

receive  into  R 

t>  One  part  of  R  comes  from  each  of  7  processors 

14 

end  for 

15 

call  BFS(L,  R,  P,  k  -  1,  rank  ) 

16 

for  i  =  1 ...  7  do 

17 

target  =  rank 

18 

target  [fc]  =  i 

19 

send  ith  part  of  P  to  target 

20 

receive  from  target  into  Qi 

21 

end  for 

22 

locally  compute  C  from  Qi 

[>  following  Strassen-Winograd 

23 

end  procedure 
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memory  is  required  for  the  local  multiplications.1  Thus,  the  total  local  memory  requirement 
for  taking  k  BFS  steps  is  given  by 


MemUM(n,  P) 


7  n2  4  n2 

p2/uj0  p  ~ 


9  ( 


(AC ) 

^  pz/uo  J  ' 


11.2.3.2  Computational  Costs 

The  computation  required  at  a  given  BFS  step  is  that  of  the  local  additions  and  subtrac¬ 
tions  associated  with  computing  the  Si  and  T%  and  updating  the  output  matrix  C  with  the 
Qi .  Since  the  Strassen-Winograd  algorithm  performs  15  additions  and  subtractions,  the 
computational  cost  recurrence  is 


P) 


+ 


n  P\ 

2’  7 ) 


with  base  case  FumC^,  1)  =  cyrU0  —  6 n2,  where  cs  is  the  constant  of  Strassen’s  algorithm. 
See  Section  2.4.2  for  more  details.  The  solution  to  this  recurrence  is 


P) 


csnu°  —  6  n2  /  rC°  \ 

P  =  \y)' 


11.2.3.3  Communication  Costs 

Consider  the  communication  costs  associated  with  the  UM  scheme.  Given  that  the  redis¬ 
tribution  within  a  BFS  step  is  performed  by  an  all-to-all  communication  step  among  sets 
of  7  processors,  each  processor  sends  6  messages  and  receives  6  messages  to  redistribute 
Si, ,  Si,  and  the  same  for  Ti, . . . ,  T7.  Each  processor  can  pack  the  Si  and  T%  data  for  a 
single  other  processor  into  one  message.  After  the  products  Qi  =  Sip  are  computed,  each 
processor  sends  6  messages  and  receives  6  messages  to  redistribute  Qi, ... ,  Q 7.  The  size  of 

each  message  varies  according  to  the  recursion  depth,  and  is  the  number  of  words  a  processor 

2 

owns  of  any  Si,  Tt,  or  Qi,  namely  fp  words. 

As  each  of  the  Q t  is  computed  simultaneously  on  disjoint  sets  of  P/7  processors,  we 
obtain  a  cost  recurrence  for  the  entire  UM  scheme: 

WVM(n,  P )  =  36—  +  Wu„  ( j) 

3'i.mV  P)  —  24  +  Sum  ^  2  ’  7"^) 

1If  one  does  not  overwrite  the  input,  it  is  impossible  to  run  Strassen  in  place;  however  using  a  few 
temporary  matrices  affects  the  analysis  here  by  a  constant  factor  only. 
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with  base  case  1)  =  I'I'iim (rt,  1)  =  0.  Thus 


12n2 


12n2 


S\jM(n,  P)  —  24  log7  P  —  O  (logP) . 


(  n2  \ 


1  J 


(11.1) 


11.2.4  Limited  Memory  Scheme 

In  this  section  we  discuss  a  scheme  for  traversing  Strassen’s  recursion  tree  in  the  context 
of  limited  memory.  In  the  LM  scheme,  we  take  I  DFS  steps  in  a  row  followed  by  k  BFS 
steps  in  a  row,  where  I  is  minimized  subject  to  the  memory  constraints.  That  is,  we  use  a 
sequence  of  DFS  steps  to  reduce  the  problem  size  so  that  we  can  use  the  UM  scheme  on  each 
subproblem  without  exceeding  the  available  memory. 

Consider  taking  a  single  DFS  step.  Rather  than  allocating  space  for  and  computing  all  14 
matrices  Si,  Tj,, . . ,  SV,  T7  at  once,  the  DFS  step  requires  allocation  of  only  one  subproblem, 
and  each  of  the  Qi  will  be  computed  in  sequence. 

Consider  the  ith  subproblem:  as  before,  both  S)  and  T*  can  be  computed  locally.  After 
Qi  is  computed,  it  is  used  to  update  the  corresponding  quadrants  of  C  and  then  discarded 
so  that  its  space  in  memory  (as  well  as  the  space  for  Si  and  T*)  can  be  re-used  for  the  next 
subproblem.  In  a  DFS  step,  no  redistribution  occurs.  After  Si  and  T,:  are  computed,  all 
processors  participate  in  the  computation  of  Qi. 

We  assume  that  some  extra  memory  is  available.  To  be  precise,  assume  the  matrices  A, 
B,  and  C  require  only  |  of  the  available  memory: 

q n 2  1 

—  <  -M.  (11.2) 

P  -  3  y  J 

In  the  LM  scheme,  we  set 


£  =  max 


4n 


P  1/wo  M1/2 


(11.3) 


The  following  subsection  shows  that  this  choice  of  £  is  sufficient  not  to  exceed  the  memory 
capacity. 


11.2.4.1  Memory  Requirements 

The  extra  memory  requirement  for  a  DFS  step  is  the  space  to  store  one  subproblcm.  Thus, 
the  extra  space  required  at  this  step  is  1/4  the  space  required  to  store  A,  B,  and  C.  The 


CHAPTER  11.  COMMUNICATION-AVOIDING  PARALLEL  STRASSEN 


177 


local  memory  requirements  for  the  LM  scheme  is  given  by 


MemLM(n,  P)  =  —  ^  (  -  )  +  MemUM  [~r  P 


2  i-i 


P 


M 

<  — 


i=  0 
t- 1 


21' 


i= 0 


E  + 


HA 

p2/u0 


127 

<  m M  <  M, 

where  the  last  line  follows  from  (11.3)  and  (11.2).  Thus,  the  limited  memory  scheme  does 
not  exceed  the  available  memory. 


11.2.4.2  Computational  Costs 


As  in  the  UM  case,  the  computation  required  at  a  given  DFS  step  is  that  of  the  local 
additions  and  subtractions  associated  with  computing  the  St  and  Tt  and  updating  the  output 
matrix  C  with  the  Qi.  However,  since  all  processors  participate  in  each  subproblem  and  the 
subproblems  are  computed  in  sequence,  the  recurrence  is  given  by 

Fl m(i>,  P)  =  15  ( A)  +  7  .  Flm  p)  . 

After  l  steps  of  DFS,  the  size  of  a  subproblems  is  |jr  x  and  there  are  P  processors  involved. 
We  take  k  BFS  steps  to  compute  each  of  these  7e  subproblems.  Thus 

and 


Flm  (n,  P) 


i= 0 


+  7l  ■  Fum 


csrC°  —  6  n2  f  no\ 

P  =  ~  \P ) ' 


11.2.4.3  Communication  Costs 


Since  there  are  no  communication  costs  associated  with  a  DFS  step,  the  recurrence  is  simply 


WLM(n,P)  =  7-WLMQ,P) 
ShM(n,  P)  =  7  ■  Slm  ’  P ) 


Si 


LM 


Sum 


with  base  cases 
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Thus  the  total  communication  costs  are  given  by 

Wlm  (n,  P)  =  7e-  IUUM  (|,  P)  <  12 ' 4"°  ^ 


PM“  o/2-l 


=  0 


0  (- 


n 


U  0 


V  0/2-1 


Slm  (rc,  P)  =  7e  ■  S'um  (|,p)  < 


(4  n) 


CDO 


PM“°/2 


24  log7  P  =  0 


0  f- 


n 


UJO 


V  PMU°/2 


log  P  J  . 


/ 


(11.4) 


11.2.5  Communication  Optimality 

Theorem  11.1.  CAPS  has  computational  cost  0  (^);  bandwidth  cost 

(  f  72^°  V?  1  \ 

and  latency  cost 

S  =  e  (max  ( log  P,  log  p})  . 

Proof.  In  the  case  that  M  >  MemUM(n,  P)  =  Q  ^  p%)UQ  j  the  UM  scheme  is  possible.  Then 

the  communication  costs  are  given  by  (11.1)  which  matches  the  lower  bound  of  Theorem  6.8. 
Thus  the  UM  scheme  is  communication-optimal  (up  to  a  logarithmic  factor  in  the  latency 
cost  and  assuming  that  the  data  is  initially  distributed  as  described  in  Section  11.2.2).  For 
smaller  values  of  M,  the  LM  scheme  must  be  used.  Then  the  communication  costs  are 
given  by  (11.4)  and  match  the  lower  bound  of  Theorem  5.10,  so  the  LM  scheme  is  also 
communication-optimal.  □ 

By  Theorems  5.10  and  6.8,  we  see  that  CAPS  has  optimal  computational  and  bandwidth 
costs,  and  that  its  latency  cost  is  at  most  log  P  away  from  optimal. 

We  note  that  for  the  LM  scheme,  since  both  the  computational  and  communication  costs 
are  proportional  to  j,,  we  can  expect  perfect  strong  scaling:  given  a  fixed  problem  size, 
increasing  the  number  of  processors  by  some  factor  will  decrease  each  cost  by  the  same 
factor.  However,  this  strong  scaling  property  has  a  limited  range.  As  P  increases,  holding 
everything  else  constant,  the  global  memory  size  PAL  increases  as  well.  The  limit  of  perfect 
strong  scaling  is  exactly  when  there  is  enough  memory  for  the  UM  scheme.  See  Section  6.2.2 
for  details. 


11.3  Analysis  of  Other  Algorithms 

In  the  section  we  detail  the  asymptotic  communication  costs  of  other  matrix  multiplication 
algorithms,  both  classical  and  Strassen-based.  These  communication  costs  and  the  corre¬ 
sponding  lower  bounds  are  summarized  in  Table  11.1. 

Many  of  the  algorithms  described  in  this  section  are  hybrids  of  two  different  algorithms. 
We  use  the  convention  that  the  names  of  the  hybrid  algorithms  are  composed  of  the  names 
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Flops 

Bandwidth 

Latency 

Classical 

Lower  Bound  [19,  95] 

n3 

P 

max  {  — n .  /9  ,  “  \ 

\  PM1/2  ’  P2/ 3  J 

maX  {  PM3/2  ’  l} 

2D  [49,  71] 

n3 

P 

n2 

pi72 

pl/2 

3D  [2,  35] 

n3 

P 

n2 

p2/3 

logP 

2.5D  (optimal)  [137] 

n3 

P 

max  {  — n,  /9  ,  “  \ 

\  PM1/2  ’  P2/ 3  J 

PM3/2 

Strassen-based 

Lower  Bound  [19,  25] 

P 

max  { - n  °/2 — y  ?  — 2/uj  \ 

l  PMwo/2- 1  ’  p2/“o  J 

^{pM- 0/2  -1} 

2D-Strassen  [108] 

nu  0 

p(ujQ~1)/2 

n2 

P 1/2 

pl/2 

Strassen-2D  [78,  108] 

(lY  ni 

(iy 

V4  /  pl/2 

7 1  pi/2 

2.5D-Strassen 

max  {  pm3/2-wo/2  ’  Pw o/3  } 
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Table  11.1:  Asymptotic  matrix  multiplication  computational  and  communication  costs  of 
algorithms  and  corresponding  lower  bounds.  Here  ojq  =  lg  T  ~  2.81  is  the  exponent  of 
Strassen;  I  is  the  number  of  Strassen  steps  taken.  None  of  the  Strassen-based  algorithms 
except  for  CAPS  attain  the  lower  bounds  of  Chapter  5  or  Section  6.2. 1.2,  see  Section  11.3 
for  a  discussion  of  each. 


of  the  two  component  algorithms,  hyphenated.  The  first  name  describes  the  algorithm  used 
at  the  top  level,  on  the  largest  problems,  and  the  second  describes  the  algorithm  used  at  the 
base  level  on  smaller  problems. 

11.3.1  Classical  Algorithms 

Classical  algorithms  must  communicate  asymptotically  more  than  an  optimal  Strassen-based 
algorithm.  To  compare  the  lower  bounds,  it  is  necessary  to  consider  three  cases  for  the 
memory  size:  when  the  memory-dependent  bounds  dominate  for  both  classical  and  Strassen, 
when  the  memory-dependent  bound  dominates  for  classical,  but  the  memory-independent 
bound  dominates  for  Strassen,  and  when  the  memory-independent  bounds  dominate  for  both 
classical  and  Strassen.  This  analysis  is  detailed  in  [105,  Appendix  B],  Briefly,  the  factor  by 
which  the  classical  bandwidth  cost  exceeds  the  Strassen  bandwidth  cost  is  Pa  where  a  ranges 
from  A  _  |  ~  0.046  to  ~  0.10  depending  on  the  relative  problem  size.  The  same  sort  of 
analysis  is  used  throughout  Section  11.3  to  compare  each  algorithm  with  the  Strassen-based 
lower  bounds. 

Various  parallel  classical  matrix  multiplication  algorithms  minimize  communication  rel¬ 
ative  to  the  classical  lower  bounds  for  certain  amounts  of  local  memory  M.  For  example, 
Cannon’s  algorithm  [49]  minimizes  communication  for  M  =  0(n2/P).  Several  more  prac¬ 
tical  algorithms  exist  (such  as  SUMMA  [71])  which  use  the  same  amount  of  local  memory 


CHAPTER  11.  COMMUNICATION-AVOIDING  PARALLEL  STRASSEN 


180 


and  have  the  same  asymptotic  communication  costs.  We  call  this  class  of  algorithms  “2D” 
because  the  communication  patterns  follow  a  two-dimensional  processor  grid. 

Another  class  of  algorithms,  known  as  “3D”  [35,  2]  because  the  communication  pattern 
maps  to  a  three-dimensional  processor  grid,  uses  more  local  memory  and  reduces  communi¬ 
cation  relative  to  2D  algorithms.  This  class  of  algorithms  minimizes  communication  relative 
to  the  classical  lower  bounds  for  M  =  f2(n2 / P2/3).  As  shown  in  Section  6.2. 1.1,  it  is  not 
possible  to  use  more  memory  than  M  =  @(n2/P2/3)  to  reduce  communication. 

Recently,  a  more  general  algorithm  has  been  developed  which  minimizes  communication 
in  all  cases.  Because  it  reduces  to  a  2D  and  3D  for  the  extreme  values  of  M  but  interpolates 
for  the  values  between,  it  is  known  as  the  “2.5D”  algorithm  [137]. 

11.3.2  2D-Strassen 

One  idea  to  parallelize  Strassen-based  algorithms  is  to  use  a  2D  classical  algorithm  for  the 
inter-processor  communication,  and  use  the  fast  matrix  multiplication  algorithm  locally  [108]. 
We  call  such  an  algorithm  “2D-Strassen” .  It  is  straightforward  to  implement,  but  cannot 
attain  all  the  computational  speedup  from  Strassen  since  it  uses  a  classical  algorithm  for  part 
of  the  computation.  In  particular,  it  does  not  use  Strassen  for  the  largest  matrices,  when 
Strassen  provides  the  greatest  reduction  in  computation.  As  a  result,  the  computational 
cost  exceeds  O(nuo / P)  by  a  factor  of  pD-^o)/2  py  P010.  The  2D-Strassen  algorithm  has  the 
same  communication  cost  as  2D  algorithms,  and  hence  does  not  match  the  communication 
costs  of  CAPS.  In  comparing  the  2D-Strassen  bandwidth  cost,  0(n2/P1//2),  to  the  CAPS 
bandwidth  cost  in  Section  11.2,  note  that  for  the  problem  to  fit  in  memory  we  always  have 
M  =  fl(n2/P).  The  bandwidth  cost  exceeds  that  of  CAPS  by  a  factor  of  Pa,  where  a  ranges 
from  (3  —  Uq)/2  «  .10  to  2/ujq  —  1/2  «  .21,  depending  on  the  relative  problem  size.  Similarly, 
the  latency  cost,  O (P1//2),  exceeds  that  of  CAPS  by  a  factor  of  Pa  where  a  ranges  from 
(3-w0)/2  «  .10  to  1/2  =  .5. 

11.3.3  Strassen-2D 

The  “Strassen-2D”  algorithm  applies  £  DFS  steps  of  Strassen’s  algorithm  at  the  top  level, 
and  performs  the  7e  smaller  matrix  multiplications  using  a  2D  algorithm.  By  choosing 
certain  data  layouts  as  in  Section  11.2.2,  it  is  possible  to  do  the  additions  and  subtractions 
for  Strassen’s  algorithm  without  any  communication  [108].  However,  Strassen-2D  is  also 
unable  to  match  the  communication  costs  of  CAPS.  Moreover,  the  speedup  of  Strassen-2D 
in  computation  comes  at  the  expense  of  extra  communication.  For  large  numbers  of  Strassen 
steps  £,  Strassen-2D  can  approach  the  computational  lower  bound  of  Strassen,  but  each  step 
increases  the  bandwidth  cost  by  a  factor  of  /  and  the  latency  cost  by  a  factor  of  7.  Thus  the 

bandwidth  cost  of  Strassen-2D  is  a  factor  of  (  /]  higher  than  2D-Strassen,  which  is  already 
higher  than  that  of  CAPS.  The  latency  cost  is  even  worse:  Strassen-2D  is  a  factor  of  7e 
higher  than  2D-Strassen. 
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One  can  reduce  the  latency  cost  of  Strassen-2D  at  the  expense  of  a  larger  memory  foot¬ 
print.  Since  Strassen-2D  runs  a  2D  algorithm  7e  times  on  the  same  set  of  processors,  it  is 
possible  to  pack  together  messages  from  independent  matrix  multiplications.  In  the  best 
case,  the  latency  cost  is  reduced  to  the  cost  of  2D-Strassen,  which  is  still  above  that  of 
CAPS,  at  the  expense  of  using  a  factor  of  (|)  more  memory. 

11.3.4  2.5D-Strassen 

A  natural  idea  is  to  replace  a  2D  classical  algorithm  in  2D-Strassen  with  the  superior  2.5D 
classical  algorithm  to  obtain  an  algorithm  we  call  2.5D-Strassen.  This  algorithm  uses  the 
2.5D  algorithm  for  the  inter-processor  communication,  and  then  uses  Strassen  for  the  local 
computation.  When  M  =  0(n2/P),  2.5D-Strassen  is  exactly  the  same  as  2D-Strassen,  but 
when  there  is  extra  memory  it  both  decreases  the  communication  cost  and  decreases  the 
computational  cost  since  the  local  matrix  multiplications  are  performed  (using  Strassen)  on 
larger  matrices.  To  be  precise,  the  computational  cost  exceeds  the  lower  bound  by  a  factor  of 
Pa  where  a  ranges  from  1  —  ^  «  0.064  to  ~  0.10  depending  on  the  relative  problem  size. 
The  bandwidth  cost  exceeds  the  bandwidth  cost  of  CAPS  by  a  factor  of  Pa  where  a  ranges 
from  A  _  |  ~  0.046  to  ~  0.10.  In  terms  of  latency,  the  cost  of  pj^3/2  +  logP  exceeds 

the  latency  cost  of  CAPS  by  a  factor  ranging  from  logP  to  p(3_w°)/2  ~  p010^  depending  on 
the  relative  problem  size. 


11.3.5  Strassen-2.5D 


Similarly,  by  replacing  a  2D  algorithm  with  2.5D  in  Strassen-2D,  one  obtains  the  new  al¬ 
gorithm  we  call  Strassen-2.5D.  First  one  takes  £  DFS  steps  of  Strassen,  which  can  be  done 
without  communication,  and  then  one  applies  the  2.5D  algorithm  to  each  of  the  7e  subprob¬ 
lems.  The  computational  cost  is  exactly  the  same  as  Strassen-2D,  but  the  communication 
cost  will  typically  be  lower.  Each  of  the  7l  subproblcms  is  multiplication  of  n/2(-  x  n/2e  ma¬ 
trices.  Each  subproblem  uses  only  l/4£  as  much  memory  as  the  original  problem.  Thus  there 
may  be  a  large  amount  of  extra  memory  available  for  each  subproblem,  and  the  lower  com¬ 
munication  costs  of  the  2.5D  algorithm  help.  The  choice  of  £  that  minimizes  the  bandwidth 
cost  is 


-opt 


=  max  jo, 


log; 


n 


2  Ml/2pl/3 


2 

The  same  choice  minimizes  the  latency  cost.  Note  that  when  M  >  ,  taking  zero  Strassen 

steps  minimizes  the  communication  within  the  constraints  of  the  Strassen-2.5D  algorithm. 
With  £  =  £opt,  the  bandwidth  cost  is  a  factor  of  P1-^ o/3  ~  p0-064  g^y-g  that  of  CAPS. 
Additionally,  the  computational  cost  is  not  optimal,  and  using  £  =  £opt,  the  computational 
cost  exceeds  the  optimal  by  a  factor  of  p^-^o/3  m3I2~w°I2  ^  p0064^0-096. 

It  is  also  possible  to  take  £  >  £opt  steps  of  Strassen  to  decrease  the  comptutational 
cost  further.  However  the  decreased  computational  cost  comes  at  the  expense  of  higher 
communication  cost,  as  in  the  case  of  Strassen-2D.  In  particular,  each  extra  step  over  £opt 
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increases  the  bandwidth  cost  by  a  factor  of  L{  and  the  latency  cost  by  a  factor  of  7.  As 
with  Strassen-2D,  it  is  possible  to  use  extra  memory  to  pack  together  messages  from  several 
subproblems  and  decrease  the  latency  cost,  but  not  the  bandwidth  cost. 

11.4  Performance  Results 

We  have  implemented  CAPS  using  MPI  on  a  Cray  XT4,  and  compared  it  to  various  previous 
classical  and  Strassen-based  algorithms.  The  benchmarking  data  is  shown  in  Figure  11.2. 

11.4.1  Experimental  Setup 

The  nodes  of  the  Cray  XT4  have  8GB  of  memory  and  a  quad-core  AMD  “Bupdapest” 
processor  running  at  2.3GHz.  We  treat  the  entire  node  as  a  single  processor,  and  when  we 
use  the  classical  algorithm  we  call  the  optimized  threaded  BLAS  in  Cray’s  LibSci  to  provide 
parallelism  between  the  four  cores  in  a  node.  The  peak  flop  rate  is  9.2  GFLOPS  per  core,  or 
36.8  GFLOPS  per  node.  The  machine  consists  of  9,572  nodes.  All  the  data  in  Figure  11.2 
is  for  multiplying  two  square  matrices  with  n  =  94080. 

11.4.2  Performance 

Note  that  the  vertical  scale  of  Figure  11.2  is  “effective  GFLOPS”,  which  is  a  useful  measure 
for  comparing  classical  and  fast  matrix  multiplication  algorithms.  It  is  calculated  as 

2  n  3 

Effective  GFLOPS  =  — - — — -.  (11.5) 

(Execution  time  in  seconds)  10J 

For  classical  algorithms,  which  perform  2 n3  floating  point  operations,  this  gives  the  actual 
GFLOPS.  For  fast  matrix  multiplication  algorithms  it  gives  the  performance  relative  to  clas¬ 
sical  algorithms,  but  does  not  accurately  represent  the  number  of  floating  point  operations 
performed.  For  this  problem  size  and  number  of  Strassen  steps  taken,  the  actual  number  of 
flops  is  about  45%  of  that  of  the  classical  algorithms. 

Our  algorithm  outperforms  all  previous  algorithms,  and  attains  performance  as  high  as 
49.1  effective  GFLOPS/node,  which  is  33%  above  the  theoretical  maximum  for  all  classical 
algorithms.  Compared  with  the  best  classical  implementation,  our  speedup  ranges  from 
51%  for  small  values  of  P  up  to  94%  when  using  most  of  the  machine.  Compared  with  the 
best  previous  parallel  Strassen  algorithms,  our  speedup  ranges  from  24%  up  to  184%.  LInlikc 
previous  Strassen  algorithms,  we  are  able  to  attain  substantial  speedups  over  the  entire  range 
of  processors. 

11.4.3  Strong  Scaling 

Figure  11.2  is  a  strong  scaling  plot:  the  problem  size  is  fixed  and  each  algorithm  is  run 
with  P  ranging  from  the  minimum  that  provides  enough  memory  up  to  the  largest  allowed 
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Figure  11.2:  Strong  scaling  performance  of  various  matrix  multiplication  algorithms  on  a 
Cray  XT4  for  fixed  problem  size  n  =  94080.  The  top  line  is  CAPS  as  described  in  Sec¬ 
tion  11.2,  and  substantially  outperforms  all  the  other  classical  and  Strassen-based  algorithms. 
The  horizontal  axis  is  the  number  of  nodes  in  log-scale.  The  vertical  axis  is  effective  GFLOPS, 
which  are  a  performance  measure  rather  than  a  flop  rate,  as  discussed  in  Section  11.4.2.  See 
Section  11.4.4  for  a  description  of  each  implementation. 


value  of  P  smaller  than  the  size  of  the  machine.  Perfect  strong  scaling  corresponds  to  a 
horizontal  line  in  the  plot.  As  the  communication  analysis  predicts,  CAPS  exhibits  better 
strong  scaling  than  any  of  the  other  algorithms  (with  the  exception  of  ScaLAPACK,  which 
obtains  very  good  strong  scaling  by  having  poor  performance  for  small  values  of  P). 

11.4.4  Details  of  the  Implementations 

11.4.4.1  CAPS 

This  implementation  is  the  CAPS  algorithm,  with  a  few  modifications  from  the  presentation 
in  Section  11.2.  First,  when  computing  locally  it  switches  to  classical  matrix  multiplication 
below  some  size  n0.  Second,  it  is  generalized  to  run  on  P  =  c7k  processors  for  c  €  {1,  2,  3,  6} 
rather  than  just  7k  processors.  As  a  result,  the  base-case  classical  matrix  multiplication  is 
done  on  c  processors  rather  than  1.  Finally,  implementation  uses  the  Winograd  variant  of 
Strassen;  see  Section  2.4.2  for  more  details.  Every  point  in  the  plot  is  tuned  to  use  the  best 
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interleaving  pattern  of  BFS  and  DFS  steps,  and  the  best  total  number  of  Strassen  steps.  For 
points  in  the  figure,  the  optimal  total  number  of  Strassen  steps  is  always  5  or  6. 

11.4.4.2  ScaLAPACK 

We  use  ScaLAPACK  [44]  as  optimized  by  Cray  in  LibSci.  This  is  an  implementation  of  the 
SUMMA  algorithm,  and  can  run  on  an  arbitrary  number  of  processors.  It  should  give  the 
best  performance  if  P  is  a  perfect  square  so  the  processors  can  be  placed  on  a  square  2D 
grid.  All  the  runs  shown  in  Figure  11.2  are  with  P  a  perfect  square. 

11.4.4.3  2.5D  Classical 

This  is  the  code  of  [137].  It  places  the  P  processors  in  a  grid  of  size  \J P/c  x  \JP/c  x  c,  and 
requires  that  \JP/c  and  c  are  integers  with  1  <  c  <  P1/3,  and  c  divides  y/P/c.  Additionally, 
it  gets  the  best  performance  if  c  is  as  large  as  possible,  given  the  constraint  that  c  copies 
of  the  input  and  output  matrices  fit  in  memory.  In  the  case  that  c  =  1  this  code  is  an 
optimized  implementation  of  SUMMA.  The  values  of  P  and  c  for  the  runs  in  Figure  11.2  are 
chosen  to  get  the  best  performance.  The  optimal  permissible  values  of  c  ranged  from  1  on 
64  processors  to  20  on  8000  processors. 

11.4.4.4  Strassen-2D 

Following  the  algorithm  of  [78,  108],  this  implementation  uses  the  DFS  code  from  the  im¬ 
plementation  of  CAPS  at  the  top  level,  and  then  uses  the  optimized  SUMMA  code  from  the 
2.5D  implementation  with  c  =  1.  Since  the  classical  code  requires  that  P  is  a  perfect  square, 
this  requirement  applies  here  as  well.  The  number  of  Strassen  steps  taken  is  tuned  to  give 
the  best  performance  for  each  P  value,  and  the  optimal  number  varies  from  0  to  2. 

11.4.4.5  2D-Strassen 

Following  the  algorithm  of  [108],  the  2D-Strassen  implementation  is  analogous  to  the  Strass- 
en-2D  implementation,  but  with  the  classical  algorithm  run  before  taking  local  Strassen  steps. 
Similarly,  the  same  code  is  used  for  local  Strassen  steps  here  and  in  our  implementation  of 
CAPS.  This  code  also  requires  that  P  is  a  perfect  square.  The  number  of  Strassen  steps  is 
tuned  for  each  P  value,  and  the  optimal  number  varies  from  0  to  3. 

11.4.4.6  2.5D-Strassen 

This  implementation  uses  the  2.5D  implementation  to  reduce  the  problem  to  one  processor, 
then  takes  several  Strassen  steps.  The  processor  requirements  are  the  same  as  for  the  2.5D 
implementation.  The  number  of  Strassen  steps  is  tuned  for  each  number  of  processors,  and 
the  optimal  number  varies  from  0  to  3.  We  also  tested  the  Strassen-2.5D  algorithm,  but  its 
performance  was  always  lower  than  2.5D-Strassen  in  our  experiments. 
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11.5  Conclusions  and  Future  Work 

11.5.1  Stability  of  Fast  Matrix  Multiplication 

CAPS  has  the  same  stability  properties  as  sequential  versions  of  Strassen.  For  discussion  of 
the  stability  of  fast  matrix  multiplication  algorithms,  see  [59,  85].  We  highlight  a  few  main 
points  here.  The  tightest  error  bounds  for  classical  matrix  multiplication  are  component¬ 
wise:  \C  —  C\  <  ne\A\  ■  \B\,  where  C  is  the  computed  result  and  e  is  the  machine  precision. 
Strassen  and  other  fast  algorithms  do  not  satisfy  component-wise  bounds  but  do  satisfy 
the  slightly  weaker  norm-wise  bounds:  \\C  —  C||  <  /(n)e||A||  ||S||,  where  ||A||  =  maxjj  Aij 
and  /  is  polynomial  in  n  [85].  Accuracy  can  be  improved  with  the  use  of  diagonal  scaling 
matrices:  DiCD3  =  DiAD2  ■  D^BD?,.  It  is  possible  to  choose  D1:  D2,  D3  so  that  the  error 
bounds  satisfy  either  | Qj  -  Ci;j\  <  f(n)e\\A(i,:)\\\\B(:,j)\\  or  \\C  -  C ||  <  f(n)e\\\A\  •  |R|||. 
By  scaling,  the  error  bounds  on  Strassen  become  comparable  to  those  of  many  other  dense 
linear  algebra  algorithms,  such  as  LU  and  QR  decomposition  [58].  Thus  using  Strassen  for 
the  matrix  multiplications  in  a  larger  computation  will  often  not  harm  the  stability  at  all. 

11.5.2  Hardware  Scaling 

Although  Strassen  performs  asymptotically  less  computation  and  communication  than  clas¬ 
sical  matrix  multiplication,  it  is  more  demanding  on  the  hardware.  That  is,  if  one  wants 
to  run  matrix  multiplication  near  the  peak  CPU  speed,  Strassen  is  more  demanding  of  the 
memory  size  and  communication  bandwidth.  This  is  because  the  ratio  of  computational  cost 
to  bandwidth  cost  is  lower  for  Strassen  than  for  classical.  From  the  lower  bound  in  Theorem 
5.10,  the  asymptotic  ratio  of  computational  cost  to  bandwidth  cost  is  for  Strassen- 

based  algorithms,  versus  M1/2  for  classical  algorithms.  This  means  that  it  is  harder  to  run 
Strassen  near  peak  than  it  is  to  run  classical  matrix  multiplication  near  peak.  In  terms  of 
the  machine  parameters  (3  and  7  introduced  in  Section  2.1.3,  the  condition  to  be  able  to  be 
computation-bound  is  7 M1/2  >  c/3  for  classical  matrix  multiplication  and  7 y/W2-1  >  d f3 
for  Strassen.  Here  c  and  d  are  constants  that  depend  on  the  constants  in  the  communication 
and  computational  costs  of  classical  and  Strassen-based  matrix  multiplication. 

The  above  inequalities  may  guide  hardware  design  as  long  as  classical  and  Strassen  matrix 
multiplication  are  considered  important  computations.  They  apply  both  to  the  distributed 
case,  where  M  is  the  local  memory  size  and  f3  is  the  inverse  network  bandwidth,  and  to  the 
sequential/shared-memory  case  where  M  is  the  cache  size  and  (3  is  the  inverse  memory-cache 
bandwidth. 

11.5.3  Optimizing  On-Node  Performance 

Note  that  although  our  implementation  performs  above  the  classical  peak  performance,  it 
performs  well  below  the  corresponding  Strassen-Winograd  peak,  defined  by  the  time  it  takes 
to  perform  csnu° / P  flops  at  the  peak  speed  of  each  processor.  To  some  extent,  this  is  because 
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Strassen  is  more  demanding  on  the  hardware,  as  noted  above.  However  we  have  not  yet  ana¬ 
lyzed  whether  the  amount  our  performance  is  below  Strassen  peak  can  be  entirely  accounted 
for  based  on  machine  parameters.  It  is  also  possible  that  a  high  performance  shared-memory 
Strassen  implementation  might  provide  substantial  speedups  for  our  implementation,  as  well 
as  for  2D-Strassen  and  2.5D-Strassen. 

11.5.4  Parallelizing  Other  Algorithms 

11.5.4.1  Another  Optimal  Classical  Algorithm 

We  can  apply  our  parallelization  approach  to  recursive  classical  matrix  multiplication  to 
obtain  a  communication-optimal  algorithm.  This  algorithm  has  the  same  asymptotic  com¬ 
munication  costs  as  the  2.5D  algorithm  [137].  We  observed  comparable  performance  to  the 
2.5D  algorithm  on  our  experimental  platform.  As  with  CAPS,  this  algorithm  has  not  been 
optimized  for  contention,  whereas  the  2.5D  algorithm  is  very  well  optimized  for  contention 
on  torus  networks. 

11.5.4.2  Other  Fast  Matrix  Multiplication  Algorithms 

Our  approach  of  executing  a  recursive  algorithm  in  parallel  by  traversing  the  recursion  tree 
in  DFS  (sequential)  or  BFS  (parallel)  manners  is  not  limited  to  Strassen’s  algorithm.  All  fast 
square  matrix  multiplication  algorithms  are  built  out  of  ways  to  multiply  no  x  no  matrices 
using  q  <  71q  multiplications.  Like  with  Strassen  and  Strassen-Winograd,  they  compute  q 
linear  combinations  of  entries  of  each  of  A  and  H,  multiply  these  pairwise,  then  compute  the 
entries  of  C  as  linear  combinations  of  these.2  CAPS  can  be  easily  generalized  to  any  such 
multiplication,  with  the  following  modifications: 

•  The  number  of  processors  P  is  a  power  of  q. 

•  The  data  layout  must  be  such  that  all  n q  blocks  of  A,  B ,  and  C  are  distributed  equally 
among  the  P  processors  with  the  same  layout. 

•  The  BFS  and  DFS  determine  whether  the  q  multiplications  are  performed  in  parallel 
or  sequentially. 

The  communication  costs  are  then  exactly  as  above,  but  with  co0  =  logno  q. 

It  is  unclear  whether  any  of  the  faster  matrix  multiplication  algorithms  are  useful  in 
practice.  One  reason  is  that  the  fastest  algorithms  are  not  explicit.  Rather,  there  are  non¬ 
constructive  proofs  that  the  algorithms  exist.  To  implement  these  algorithms,  they  would 
have  to  be  found,  which  appears  to  be  a  difficult  problem.  The  generalization  of  CAPS 
described  in  this  section  does  apply  to  all  of  them,  so  we  have  proved  the  existence  of  a 
communication-avoiding  non-explicit  parallel  algorithm  corresponding  to  every  fast  matrix 

2By  [124],  all  fast  matrix  multiplication  algorithms  can  be  expressed  in  this  bilinear  form. 
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multiplication  algorithm.  We  conjecture  that  the  algorithms  are  all  communication  optimal 
(i.e.,  they  attain  a  known  lower  bound),  but  that  is  not  yet  proved  since  the  lower  bound 
proofs  in  Chapter  5  and  Section  6.2. 1.2  may  not  apply  to  all  fast  matrix  multiplication 
algorithms.  In  cases  where  the  lower  bounds  do  apply,  they  match  the  performance  of  the 
generalization  of  CAPS,  and  so  the  algorithms  are  communication  optimal. 
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Chapter  12 
Conclusion 


In  this  thesis,  we  have  considered  the  fundamental  operations  in  dense  linear  algebra.  We 
have  seen  that,  even  for  a  fixed  computation,  many  algorithms  exist  that  exhibit  a  range  of 
communication  costs  when  analyzed  on  simple  sequential  and  parallel  models.  By  proving 
lower  bounds  on  the  communication  required  for  particular  computations  and  developing 
more  communication-efficient  algorithms,  we  are  able  to  devise  optimal  algorithms  on  state- 
of-the-art  hardware  and  observe  better  performance  than  existing  approaches.  Extending  the 
algorithmic  ideas  and  techniques  to  computations  outside  of  dense  linear  algebra  can  benefit 
many  more  application  areas.  Even  in  scientific  computing,  other  “dwarves”  include  spectral 
methods  (FFTs),  sparse  linear  algebra,  and  graph  algorithms,  and  communication-avoiding 
ideas  have  already  proved  effective  for  these  areas  [36,  114,  136].  We  believe  similar  ideas 
can  and  will  benefit  an  even  larger  range  of  computations  and  applications  in  the  future. 

We  now  briefly  summarize  our  contributions  and  list  possible  directions  for  future  research 
within  the  realm  of  dense  linear  algebra.  In  this  work,  we  prove  communication  lower  bounds 
for  dense  and  sparse,  sequential  and  parallel  algorithms  for  a  general  set  of  classical  @(n3) 
linear  algebra  computations  (Chapters  3  and  4).  We  also  prove  communication  lower  bounds 
for  Strassen’s  and  Strassen-like  fast  matrix  multiplication  algorithms  (Chapters  5  and  6)  and 
establish  a  separate  set  of  lower  bounds  that  apply  to  parallel  algorithms  and  set  limits  on  an 
algorithm’s  ability  to  perfectly  strong  scale  (Chapter  6).  In  Chapters  7  and  8,  we  summarize 
the  state-of-the-art  in  communication  efficiency  for  both  sequential  and  parallel  algorithms 
for  the  fundamental  computations  to  which  the  lower  bounds  apply.  Finally,  we  present  new 
algorithms  for  three  particular  computations:  computing  a  symmetric-indefinite  factorization 
(Chapter  9),  reducing  a  symmetric  band  matrix  to  tridiagonal  form  via  orthogonal  similarity 
transformations  (Chapter  10),  and  parallelizing  Strassen’s  matrix  multiplication  algorithm 
(Chapter  11). 

While  there  are  many  open  problems  described  throughout  the  previous  chapters,  and 
developing  optimized  implementations  on  the  most  current  hardware  is  ongoing  challenge, 
we  highlight  here  only  a  few  areas  of  possible  future  algorithmic  research: 

•  developing  communication-optimal  extra-memory  algorithms  for  all  of  the  computa- 
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tions  discussed  in  Chapter  8,  which  likely  includes  tightening  lower  bounds; 

•  extending  algorithmic  techniques  for  dense  matrices  to  sparse  direct  methods,  where 
computations  are  more  irregular  and  also  communication  bound; 

•  finding  and  parallelizing  other  fast,  practical  matrix  multiplication  algorithms  that  are 
asymptotically  superior  to  Strassen’s  algorithm  but  also  can  be  implemented  to  be 
faster  in  practice;  and 

•  developing  parallel  fast  linear  algebra  computations  ( e.g for  solving  linear  systems  or 
least  squares  problems)-parallelizing  those  algorithms  that  rely  on  fast  matrix  multi¬ 
plication  subroutines  and  are  already  communication  optimal  in  the  sequential  case. 

As  mentioned  in  Chapter  1,  the  costs  of  communication  relative  to  computation  are  grow¬ 
ing,  and  increased  parallelism  (on  and  off  chip)  means  that  careful  consideration  of  data 
movement  is  necessary  to  achieve  satisfactory  performance  on  today’s  and  future  machines. 
Not  only  is  communication  becoming  more  important,  it  is  also  becoming  more  difficult  to 
reason  about:  processors  are  growing  more  complex,  with  heterogeneous  computational  units 
and  the  ability  to  vary  clock  speeds  with  time,  for  two  examples.  Developing  efficient  algo¬ 
rithms  and  optimized  implementations,  within  linear  algebra  and  in  other  domains,  requires 
both  strong  theoretical  foundations  and  continual  adaptation  to  new  architectures.  Beyond 
the  particular  results  in  this  thesis,  the  performance  analysis  and  lower  bound  techniques 
establish  a  way  of  thinking  about  algorithm  design  that  will  be  increasingly  important  as 
machines  evolve. 
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